Jeremy G. Kahn's PhD dissertation

c©Copyright 2010

Jeremy G. Kahn

Parse decoration of the word sequence in the speech-to-textmachine-translation pipeline

Jeremy G. Kahn

A dissertation submitted in partial fulfillmentof the requirements for the degree of

Doctor of Philosophy

University of Washington

2010

Program Authorized to Offer Degree: Linguistics

University of WashingtonGraduate School

This is to certify that I have examined this copy of a doctoral dissertation by

Jeremy G. Kahn

and have found that it is complete and satisfactory in all respects,and that any and all revisions required by the final

examining committee have been made.

Chair of the Supervisory Committee:

Mari Ostendorf

Reading Committee:

Mari Ostendorf

Paul Aoki

Emily M. Bender

Fei Xia

Date:

In presenting this dissertation in partial fulfillment of the requirements for the doctoraldegree at the University of Washington, I agree that the Library shall make its copiesfreely available for inspection. I further agree that extensive copying of this dissertation isallowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S.Copyright Law. Requests for copying or reproduction of this dissertation may be referredto Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346,1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copiesof the manuscript in microform and/or (b) printed copies of the manuscript made frommicroform.”

Signature

Date

University of Washington

Abstract

Parse decoration of the word sequence in the speech-to-text machine-translationpipeline

Jeremy G. Kahn

Chair of the Supervisory Committee:

Professor Mari Ostendorf

Electrical Engineering & Linguistics

Parsing, or the extraction of syntactic structure from text, is appealing to natural lan-

guage processing (NLP) engineers and researchers. Parsing provides an opportunity to

consider information about word sequence and relatedness beyond simple adjacency. This

dissertation uses automatically-derived syntactic structure (parse decoration) to improve

the performance and evaluation of large-scale NLP systems that have (in general) used

only word-sequence level measures to quantify success. In particular, this work focuses on

parse structure in the context of large-vocabulary automatic speech recognition (ASR) and

statistical machine translation (SMT) in English and (in translation) Mandarin Chinese.

The research here explores three characteristics of statistical syntactic parsing: dependency

structure, constituent structure, and parse-uncertainty — making use of the parser’s ability

to generate an M -best list of parse hypotheses.

Parse structure predictions are applied to ASR to improve word-error rate over a baseline

non-syntactic (sequence-only) language model (achieving 6–13% of possible error reduction).

Critical to this success is the joint reranking of an N×M -best list of N ASR hypothesis tran-

scripts and M -best parse hypotheses (for each transcript). Jointly reranking the N×M lists

is also demonstrated to be useful in choosing a high-quality parse from these transcriptions.

In SMT, this work demonstrates expected dependency pair match (EDPM), a new mech-

anism for evaluating the quality of SMT translation hypotheses by comparing them to refer-

ence translations. EDPM, which makes direct use of parse dependency structure directly in

its measurement, is demonstrated to be superior in correlation with human measurements

of translation quality to the competitor (and widely-used) evaluation metrics BLEU4 and

translation edit rate.

Finally, this work explores how syntactic constituents may predict or improve the behav-

ior of unsupervised word-aligners, a core component of SMT systems, over a collection of

Chinese-English parallel text with reference alignment labels. Statistical word-alignment is

improved over several machine-generated alignments by exploiting the coherence of certain

parse constituent structures to identify source-language regions where a high-recall aligner

may be trusted.

These diverse results across ASR and SMT point together to the utility of including

parse information into large-scale (and generally word-sequence oriented) NLP systems and

demonstrate several approaches for doing so.

TABLE OF CONTENTS

Page

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Evaluating the word sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Using parse information within automatic language processing . . . . . . . . 4

1.3 Overview of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Statistical parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Reranking n-best lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Statistical machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 3: Parsing Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Corpus and experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Chapter 4: Using grammatical structure to evaluate machine translation . . . . . 61

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Approach: the DPM family of metrics . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Implementation of the DPM family . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Selecting EDPM with human judgements of fluency & adequacy . . . . . . . 68

4.5 Correlating EDPM with HTER . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.6 Combining syntax with edit and semantic knowledge sources . . . . . . . . . 74

i

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Chapter 5: Measuring coherence in word alignments for automatic statistical ma-chine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Coherence on bitext spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 Analyzing span coherence among automatic word alignments . . . . . . . . . 88

5.5 Selecting whole candidates with a reranker . . . . . . . . . . . . . . . . . . . . 95

5.6 Creating hybrid candidates by merging alignments . . . . . . . . . . . . . . . 101

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.1 Summary of key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Future directions for these applications . . . . . . . . . . . . . . . . . . . . . . 109

6.3 Future challenges for parsing as a decoration on the word sequence . . . . . . 111

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

ii

LIST OF FIGURES

Figure Number Page

2.1 A lexicalized phrase structure and the corresponding constituent and depen-dency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The models that contribute to ASR. . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Word alignment between e and f . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 The models that make up statistical machine translation systems . . . . . . . 24

3.1 A SParseval example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 System architecture at test time. . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 n-best resegmentation using confusion networks . . . . . . . . . . . . . . . . . 38

3.4 Oracle parse performance contours for different numbers of parses M andrecognition hypotheses N on reference segmentations. . . . . . . . . . . . . . 51

3.5 SParseval performance for different feature and optimization conditions asa function of the size of the N-best list. . . . . . . . . . . . . . . . . . . . . . 56

4.1 Example dependency trees and their dlh decompositions. . . . . . . . . . . . 64

4.2 The dl and lh decompositions of the hypothesis tree in figure 4.1. . . . . . . 64

4.3 An example headed constituent tree and the labeled dependency tree derivedfrom it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM,BLEU and TER correlations are provided for comparison. . . . . . . . . . . . 76

5.1 A Chinese sentence and its translation, with reference alignments and align-ments generated by unioned GIZA++ . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Examples of the four coherence classes . . . . . . . . . . . . . . . . . . . . . . 83

5.3 Decision trees for VP and IP spans. . . . . . . . . . . . . . . . . . . . . . . . 93

5.4 An example incoherent CP-over-IP. . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 An example of clause-modifying adverb appearing inside a verb chain . . . . 96

5.6 An example of English ellipsis where Chinese repeats a word. . . . . . . . . . 97

5.7 Example of an NP-guided union. . . . . . . . . . . . . . . . . . . . . . . . . . 103

iii

LIST OF TABLES

Table Number Page

1.1 Two ASR hypotheses with the same WER. . . . . . . . . . . . . . . . . . . . 3

1.2 Word-sequences not considered to match by naıve word-sequence evaluation . 3

3.1 Reranker feature descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Switchboard data partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Segmentation conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Baseline and oracle WER reranking performance from N = 50 word sequencehypotheses and 1-best parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Oracle SParseval (WER) reranking performance from N = 50 word se-quence hypotheses and M = 1, 10, or 50 parses . . . . . . . . . . . . . . . . . 51

3.6 Reranker feature combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 WER on the evaluation set for different sentence segmentations and featuresets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.8 Word error rate results comparing γ . . . . . . . . . . . . . . . . . . . . . . . 54

3.9 Results under different segmentation conditions when optimizing for SPar-seval objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Per-segment correlation with human fluency/adequacy judgements of differ-ent combination methods and decompositions. . . . . . . . . . . . . . . . . . 69

4.2 Per-segment correlation with human fluency/adequacy judgements of base-lines and different decompositions. N = 1 parses used. . . . . . . . . . . . . . 70

4.3 Considering γ and N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Corpus statistics for the GALE 2.5 translation corpus. . . . . . . . . . . . . . 72

4.5 Per-document correlations of EDPM and others to HTER . . . . . . . . . . . 73

4.6 Per-sentence, length-weighted correlations of EDPM and others to HTER,by genre and by source language. . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Four mutually exclusive coherence classes for a span s and its projected range s′ 83

5.2 GALE Mandarin-English manually-aligned parallel corpora . . . . . . . . . . 84

5.3 The Mandarin-English parallel corpora used for alignment training . . . . . . 86

5.4 Alignment error rate, precision, and recall for automatic aligners . . . . . . . 88

5.5 Coherence statistics over the spans delimited by comma classes . . . . . . . . 89

v

5.6 Coherence statistics over the spans delimited by certain syntactic non-terminals 91

5.7 Some reasons for IP incoherence . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.8 Reranking the candidates produced by a committee of aligners. . . . . . . . . 99

5.9 Reranking the candidates produced by giza.union.NBEST. . . . . . . . . . . 100

5.10 AER, precision and recall for the bg-precise alignment . . . . . . . . . . . . 101

5.11 AER, precision and recall over the entire test corpus, using various XP -strategies to determine trusted spans . . . . . . . . . . . . . . . . . . . . . . . 104

vi

ACKNOWLEDGMENTS

My advisor, Mari Ostendorf, has been a reliable source of support, encouragement, and

ideas through the process of this work. An amazingly busy and productive engineering

professor, she has welcomed me into the Signal Speech and Language Interpretation (SSLI)

laboratory when I was looking only for summer employment — on the condition that I

remain with her for at least another year. It was a good bargain: Mari’s empirical, skeptical,

practical approach to research has served as a model and inspiration, and I am proud

every time I notice myself saying something Mari would have suggested. SSLI’s home in

Electrical Engineering (in a different college, let alone department, from Linguistics) has

been a valuable source of perspective: working in the lab (and with the electrical engineers

and computer scientists there) gives me the unusual privilege of being the “language guy”

among the engineers and the “engineering guy” among the linguists.

My committee of readers was delightfully representative of the intersection between

linguistics and computers. Paul Aoki represented practical translation and the use of com-

puters for language teaching — and provided unstinting positive regard for me and my

work. Emily Bender opened doors for me by opening a master’s program in computational

linguistics at the University of Washington just as I began, creating entire cohorts of pro-

fessional NLP people just across Stevens Way. Fei Xia’s perspectives on Chinese parsing

and on statistical machine translation were welcome on every single revision.

Among my colleagues at SSLI, I would like to acknowledge Becky Bates, who adopted me

as a “big sister” from my first day there, for her clear-eyed, mindful approach to engineering

education and her grounded, open approach to the full experience of the world, even for

those of us who — through practice or predisposition — spend a lot of time in our head

and in the world of words. Dustin Hillard and Kevin Duh shared their enthusiasm and

excitement for engineering and machine learning in application to language problems. Lee

vii

Damon kept the entire lab infrastructure running in the face of thousands of submitted jobs,

many of which were mine. Bin and Wei tolerated both my questions about Chinese and

my eagerly verbose explanations of some of the crookeder corners of the English language.

Alex, Brian, Julie and Amittai were always game for engaging in a discussion about tactics

and strategies for natural-language engineering graduate students, and I am pleased to leave

my role as SSLI morale officer in their hands.

Across the road in Padelford, my colleagues and teachers in the Linguistics department

have also been a pleasure. Beyond my committee members named already, I had the

pleasure of guidance and welcome from Julia Herschensohn, the departmental chair, whose

enthusiasm for an interdisciplinary computational linguist like me spared me a number of

administrative ordeals, some of which I’ll probably never know about (and I am grateful

to Julia for that). Richard Wright and Alicia Beckford-Wassink were happy to let me be

an “engineering guy” in a room full of empirical linguists. Fellow students Bill, David, and

Scott reminded me from the very beginning that having spent time in industry does not

disqualify one from still studying linguistics. Lesley, Darren, Julia, Amy, and Laurie remind

me whenever I see them (which is often online rather than in person!) that linguistics can

be fun, whichever corner of it you live in.

Over the last two years, I have had the privilege of being hosted at the Speech Technology

and Research (STAR) laboratory at SRI International in Menlo Park, California. I began

my study there as part of the DARPA GALE project, on which SSLI and the STAR lab

collaborated. STAR director Kristin Precoda graciously allowed me to use office space

and come to lab meetings, even after that project ended, while I finished my dissertation.

Dimitra, Wen, Fazil, Jing, Murat, Luciana, Martin, Colleen and Harry, support staff Allan

and Debra, and fellow SSLI alumni Arindam Mandal and Xin Lei also hosted and oriented

me during my time at SRI. All of them have been pleasant hosts and supportive colleagues.

I am doubly grateful that they tolerated my poor attempts at playing Colleen’s guitar in

the break room.

I have had fruitful and enjoyable collaborations with students and faculty beyond UW

viii

and SRI in my time in the UW graduate program: I am pleased to have explored interesting

computational linguistics research with Matt Lease, Brian Roark, Mark Johnson (who was

also my first syntax professor!), Mary Harper, and Matt Snover, among many others. I re-

ceived support and software guidance from John DeNero, Chris Dyer and Eugene Charniak,

again, among others. I am indebted to them all.

About a year before completing this dissertation, I began part-time work at Wordnik. I

am grateful to Erin McKean for offering me employment thinking about words and language

even while I finished this dissertation, and for allowing me to work less than half-time while

I finished up the thesis. This was offered with far less grumbling than I deserved. I am

lucky, too, to have intelligent, funny, talented co-workers there: Tony, John, Robert, Angela,

Russ, Kumanan, Krishna and Mark continue to be a pleasure to work with and work for.

Of course, I had little chance to complete this work without support from an amazing

troupe of supportive friends in many locations. Matt, Shannon, Kristina, Maryam, Lauren

Neil, Ben, Trey, Rosie, and others have held out from the wild world of the Internet. In

San Francisco, I am happy to have found community with Nancy, Heather, Jen, Susanna

and Derek, all holding on for Wisdom and for my success. Jim and Fiona, William and Jo,

Eldan and Melinda, Chris and Miriam, Alex and Kirk, Johns L and A, and many others

support me with love and wisdom from Seattle. Finally, I am lucky to have been supported

all along the way by my parents, Mickey and Henry; by my brother Daniel, and, most of

all, by my wife Dorothy Lemoult, whom I met in Seattle in my second year of the program.

Since the day we met, Dorothy has seen me as a better person than even I believed myself

to be; to be the object of that kind of fierce love is the best way to be alive.

I have received funding for my work from the University of Washington, the National

Science Foundation, SRI International, and the Defense Advance Research Projects Admin-

istration.

Finally, a framing comment: I was supported in the process of creating this dissertation

by a community that will undoubtedly be under-represented by any attempt to list everyone,

especially this one. To all of you I’ve overlooked or omitted, please forgive me.

ix

DEDICATION

For the pursuit of a life of love, play, and inquiry;

For my partner, my ally, my friend, my lover;

For what we have already and for what we make together;

For Dorothy.

xi

1

Chapter 1

INTRODUCTION

Parsing, or extracting syntactic structure from text, is an appealing process to lin-

guists studying the grammatical properties of natural language: parsing is an application

of syntactic theory. For non-linguists, including many natural-language engineers, it is not

necessarily of immediate practical use. Engineers and other users of language technology

have generally found word sequences (as in writing) to be a more tractable input and out-

put, and traditional evaluation measures for their tasks have not considered any linguistic

structure beyond the word sequence in their design.

While some natural language applications have embraced parsing at their core (e.g. infor-

mation extraction, which generally begins from parsed sentence structures), this dissertation

applies parsers to two other domains: automatic speech recognition (ASR) and statistical

machine translation (SMT). In evaluation, both of these natural-language processing tasks

traditionally use measurements that evaluate using only matches of words or adjacent se-

quences of words (N -grams) against a reference (human-generated) output. In ASR, parsing

features and scores have been explored for improved modeling of word sequences, but these

approaches have not been widely adopted. Similarly, although a few SMT systems use a

parse tree in parts of decoding, parse structures are also not widely adopted in SMT. For

example, statistical word-alignment, a core internal technology for SMT, generally uses no

parse information to hypothesize links between source- and target-language words.

This dissertation explores the incorporation of parsing into representations of language

for natural language processing, particularly for components that have traditionally consid-

ered only the word sequence as input and output. This work takes two related approaches:

exploring new opportunities to bring the information provided by a parser to bear within

the traditional (syntactically-uninformed) approaches to these natural-language tasks, and

2

exploring the construction of new, parser-informed automatic evaluation measures to guide

the behavior of these systems in directions that lead to qualitative improvements in results,

as judged by human assessors.

1.1 Evaluating the word sequence

This work focuses on two natural-language processing applications: speech recognition and

machine translation. The output of speech recognition is a word sequence transcription

hypothesis; the output of a machine translation system is a word sequence translation

hypothesis. In each case, the usual approach to evaluation is to compare the transcript (or

translation) hypothesis to a reference transcript (or translation).

Using the undecorated word-sequence as an interface among natural language systems

may sometimes introduce surprising behaviors in evaluation. A word sequence is a very

shallow representation of the linguistic structure of language. This representation is almost1

completely devoid of theoretical baggage: no theoretical training is required for language

users (or machines) to count over words and compare them for identity.

Speech transcript quality, for example, is ordinarily measured by word error rate

(WER), which is defined over hypothesis transcript h and reference transcript r as:

WER(h; r) =insertions(h; r) + deletions(h; r) + substitutions(h; r)

length(r)(1.1)

where insertion, deletion and substitution error counts are calculated through a Levenshtein

alignment between reference and hypothesis that minimizes the total number of errors.

Automated methods like WER facilitate the optimization and evaluation of natural

language processing technology, because they can report the quality of a hypothesis without

human intervention, given only a previously-generated reference. For these optimization

and evaluation processes, though, the automatic measures should ideally be consistent with

human judgements of quality.

Word-sequence evaluation measures, however, do not always match human judgements

about quality. For example, they rarely have any notion of centrality: no aspect of WER

1Chinese and several other written languages do not separate words in text, but there is still high agree-ment among literate speakers about the character-sequence. Character sequences, rather than word se-quences, are thus usually used for evaluation in Chinese speech recognition.

3

Table 1.1: Two ASR hypotheses with the same WER.

Hypothesis WER

Reference People used to arrange their whole schedules around those —

(a) people easter arrange their whole schedules around those 0.22

(b) people used to arrange their whole schedule and those 0.22

Table 1.2: Word-sequences not considered to match by naıve word-sequence evaluation

The man saw the cat. The cat was seen by the man.

The diplomat left the room. The diplomat went out of the room.

He quickly left. He left quickly.

He warmed the soup. He heated the soup.

optimize optimise

don’t do not

because cuz

captures the intuition that some words are more important to the sentence than others.

Table 1.1 considers two hypotheses that are projected to the same distance (WER = 0.22)

by the WER metric. In table 1.1, hypothesis (a) and hypothesis (b) have equal WER, but

(a)’s substitution is on a more central sequence (the main verb used to), while (b)’s word

errors are on a grammatical affix (schedule instead of schedules) and an adjunct adverbial

(around those). One indicator of the centrality of used to is that (b)’s substitution causes

little adjustment to the overall structure of the sentence, where (a)’s substitution leaves (a)

with no workable parse structure other than a fragment.

Conversely, table 1.2 presents some example word sequences that a human evaluator

might reasonably consider equivalent (for some evaluation tasks), and which a naıve word-

sequence evaluation would score as different. To capture any of these matches, the evaluation

sequence must be able to find a projection of the word sequences such that they may be

found equivalent. The last two pairs in table 1.2 are usually handled by normalization tools,

4

but the others are usually ignored: with the exception of contractions, case normalization

and sometimes spelling normalization, most evaluations consider only exact matches over

sections of the word-sequence, and treat all words as equally important.

Evaluation measures like WER (or extensions using N -grams) use only surface word

identity and word adjacency in their measurements. These measures incorporate neither a

notion of centrality nor argument structure, but individual words’ roles in the meaning of a

sentence are determined by their relationship to other, not necessarily adjacent words. It is

the central contention of this work that extending our measurements and evaluations of the

word sequence to include a deeper representation of linguistic structure provides benefits to

both linguistic and engineering approaches to natural language.

1.2 Using parse information within automatic language processing

The core theme of this work is the use of automatically-derived parse structure to improve

the performance and evaluation of language-processing systems that have generally used

only word-sequence level measures.

Parse decorations on the word sequence can provide benefits to these systems in these

two ways:

• parse decoration offers a new source of structural information within the models that

go into these systems, providing features from which the models may derive more

powerful hypothesis-choice criteria, and

• parse decoration enables new target measures, for use in system tuning and/or eval-

uation of the overall performance of a system.

Both of these techniques are used in this dissertation in ASR and SMT applications. For

ASR systems, this work explores using parse structure for optimization towards both WER

and SParseval (an evaluation measure for parses of speech transcription hypotheses). For

SMT systems, this work explores using parse structure towards providing an evaluation

measure that correlates better with human judgement and towards the optimization of an

internal target (word-alignment).

5

Parse structure is not observable in transcripts or other easily-derived training data (out-

side of the relatively small domain of treebanks), which is one reason that parse-information

has not been widely adopted into some of these systems. Parser accuracy, especially on gen-

res that do not match the parser’s training data, may not be very good. This work adopts

the approach that a parser’s own confidence estimates may be used to avoid egregious

blunders, by using expectations (confidence-weighted averages) over parser predictions. A

common thread among the research directions presented here is thus the use of more than

one parse-decoration hypothesis to provide structural information about the word sequence.

Previous work on applying grammatical structure to ASR systems has focused on either

parsing a single hypothesis transcript (the parsing task) or on using a single hypothesis parse

to select a transcript (the language-modeling task). By exploring the joint optimization of

parse and transcript hypotheses (chapter 3), this work demonstrates the utility of each to

the other. It frames the parse-decoration as a source of structural features of the hypothe-

ses, to be used in reranking hypotheses. In this approach, WER-optimization is improved

by including information from multiple parse hypotheses, and parse-metric optimization

is improved by comparing multiple parse hypotheses over multiple transcript hypotheses.

Because many NLP tasks either explicitly use parsing, chunking, or have verb-dependent

processing, the parse metric is often a better choice for word transcription associated with

NLP tasks.

After considering parsing as an ASR objective, we turn to incorporation of parse dec-

oration towards SMT tasks, beginning by considering SMT evaluation (chapter 4). SMT

evaluation measures have traditionally used only word-sequence information (e.g., measur-

ing the precision of n-grams against a reference translation). This work explores the use

of parsing dependency structure to provide a syntactically-sensitive evaluation measure of

the translation hypotheses. Parse structure, here, is represented as an expectation over

dependency structure (using the multiple-parse hypotheses approach suggested above), and

this work demonstrates that evaluations informed by parse-structure correlate more closely

with human judgements of translation quality than the traditional (word-sequence based)

metrics.

Previous work on applying parsers to SMT has focused mostly on parsing for reordering

6

source language text or within decoders. A third limb of the work presented here (chapter 5)

explores the use of parsers in improving translation word-alignment (an internal component

of SMT). In this approach, parse-decoration is treated as labels on source-language spans,

and this information is applied to selecting better machine translation word-alignments, an

SMT task that generally uses only word-sequence information. In this work, we explore

the coherence properties of the parse-annotated spans, finding some span-classes that tend

to be coherent, in the sense that a contiguous sequence of source language words is not

broken up in translation. This syntactic coherence is used to guide the combination of a

precision-oriented and recall-oriented automatic alignment.

By exploring applying parse decoration to word sequences, this work offers several pieces

of evidence for new directions in language-processing work. Word sequences are not always

the best way to evaluate the performance of natural language processing systems; gram-

matical structure (from parsing) is in fact a useful source of information to these other

natural-language processing systems, even when used as a component in evaluation (in ma-

chine translation). As part of those results, this work offers new reasons to use and improve

work in syntactic parsers.

1.3 Overview of this work

The dissertation’s structure is as follows: Chapter 2 covers the shared background material:

statistical parsing, and schematic overviews of the operation of ASR and SMT systems.

To accomodate the diversity of corpora and applications, some discussion of background

material and related work is deferred to the appropriate chapter, rather than covering all

background materials in chapter 2. Chapters 3–5 present the prior work, new methods and

experimental results of each of the three applications explored in this thesis.

Chapter 3 applies parsing to automatic speech recognition on English conversational

speech, and shows that information derived from parse structure offers improvements on

WER. In addition, when the ASR/parsing pipeline is directed to target a parse-quality

measure designed for speech transcripts, not only does the pipeline perform better on that

measure but it selects qualitatively different word sequences, reflecting the effect of parse

structure (and its evaluation) on speech recognition.

7

Chapter 4 proposes a new evaluation measure, Expected Dependency Pair Match (EDPM)

for machine translation evaluation. EDPM is a measure of parse-structure similarity be-

tween hypothesis and reference translations. Experiments in this chapter correlating EDPM

with human and human-derived judgments of translation quality show that EDPM surpasses

popular word-sequence-based evaluation measures and is competitive with other newly-

proposed metrics that rely on external knowledge sources.

Chapter 5 focuses on Chinese-English parallel-text word alignment, an internal com-

ponent of machine translation that also traditionally ignores structural information. This

chapter applies parsing to the Chinese side of the parallel text, and introduces translation

coherence, which is a property of a source span and an alignment. The work in this chap-

ter explores the utility of coherence in selecting good alignments, examines where those

coherence measures break down, and shows that parse structure information is useful in

selecting regions where two alignment candidates may be combined to improve alignment

recall without hurting alignment precision.

Chapter 6 concludes with a summary of the key contributions of this thesis, which include

both application advances and new understanding of general methods for leveraging parse

decorations. It further suggests future directions of research, in which parse-decoration may

be applied in new ways to machine-translation, speech recognition, and evaluation methods.

9

Chapter 2

BACKGROUND

This chapter provides an overview of the natural-language processing technologies that

this dissertation rests upon. The next section (2.1) provides some background on statistical

syntactic parsing and describes the statistical syntactic parsers in use in this work. The

subsequent section (2.2) explains the framework for n-best list reranking used in several

parts of this work. The following sections (2.3 and 2.4) describe the general framework of

the two applications (speech recognition and statistical machine translation, respectively)

to which this work applies those rerankers and parsers.

2.1 Statistical parsing

Statistical parsing serves as the method of word sequence decoration for all of the research

proposed in this work. This section reviews the key decorations available from a statistical

parser, considers the strengths and weaknesses of the probabilistic context-free grammar

(PCFG) paradigm, and discusses the training and evaluation of such parsers.

2.1.1 Constituent and dependency parse trees

The parse decorations on word sequences used here include both dependency structure

and hierarchical spans over word sequences. Hierarchical spans are known as constituent

structures; in these trees, span labels nest to form a hierarchy (a tree) of constituent spans;

these spans are labeled with the phrase class (e.g. np or vp) that describes its content. The

entire segment is labeled with a root span, which is usually coterminous with a single s

spanning the sentence.

A dependency structure, by contrast, labels each word with a single dependency link to

its “head”, with a label representing the nature of the dependency. One word (usually the

main verb of the sentence) is dependent on a notional root node; all the other words in

10

the sentence depend on other words in the sentence.

These two representations of grammatical structure may be reconciled in a lexicalized

phrase representation, which marks one child subspan as the head child of each span. If head

information is ignored, this representation is equivalent to the span label representation. The

head word of each phrase-constituent φ is recursively defined as the head word of φ’s head

child or, if φ contains only one word, that word. A constituent structure is lexicalized

when each constituent is additionally annotated with its head word; one may read either

constituent spans or dependency structures off of these lexicalized constituent structures.

Figure 2.1 shows a lexicalized constituent structure and the dependency tree and constituent

tree that may be derived from it.

The arc labels on the dependency structure shown in figure 2.1 are derived by extraction

from the headed phrase structure by concatenating two labels A/B: A is the lowest con-

stituent dominating both the dependent and the headword and B the highest constituent

dominating the dependent. This approach for arc-labeling works well for a language like

English (or Chinese) with relatively fixed word order.

2.1.2 Generating parse hypotheses

To provide parse-decoration, we desire a parser/decorator which generates n-best lists of

parse hypotheses over input sentences.1 Such an n-best list may be useful in reranking the

parse hypotheses (see section 2.2 below) or other applications which benefit from access to

the confidence of the parser. We require that the parsers used to generate these n-best lists

are adaptable to new domains, robust, and probabilistic. Retrainable parsers are desired

because the domain over which this work predicts parses varies widely with the task: parse

structures over speech, as in chapter 3, are qualitatively different than parse structures over

edited text (e.g. the news-text translation in chapter 5). Robustness, the reliable generation

of predictions for any input word sequences, is desirable because the parser is to distinguish

among machine-generated word sequences (the output of ASR and SMT), which are not

1Packed parse forests, the combined representation of the parser search space used by e.g. Huang [2008],represent a speedy and sometimes elegant alternative to n-best lists, but are constrained by the forest-packing to use only those features that may be computed locally in the tree. This work uses n-best listsinstead for their easy combination and for freedom from the tree-locality constraint.

11

root

s/was

np/I

I

vp/was

vbd/was

was

adjp/acquainted

rb/personally

personally

vbn/acquainted

acquainted

pp/with

in/with

with

np/people

dt/the

the

nns/people

people

root

s

np

I

vp

vbd

was

adjp

rb

personally

vbn

acquainted

pp

in

with

np

dt

the

nns

people

I was personally acquainted with the people root

s/np adjp/rb

root/s

vp/adjpadjp/pp

pp/np

np/dt

Figure 2.1: A lexicalized phrase structure and the corresponding constituent and depen-

dency trees. Dashed arrows indicate the upward propagation of head words to head phrases.

The lexicalized constituent tree encodes both the constituency tree and the dependency rela-

tions. The dependency tree may be understood as the link to the headword of the governing

constituent.

12

always well-formed either due to recognizer errors or speaker differences. Since we use the

parser to predict fine-grained information to make decisions about the word sequences, the

ability to generate parse structure over all (or nearly all) the candidate inputs is important.

Probabilistic scoring is required not only to predict the order of the n-best list, but to

compute the relative contribution of each parse hypothesis to the n-best list. All else being

equal, preferred parsers are also fast.

While unification grammars, e.g. head-driven phrase structure grammar [HPSG, e.g.

Pollard and Sag, 1994] and lexical functional grammar [LFG, e.g. Bresnan, 2001] produce

complex and linguistically-informed parse structures that also may be interpreted as headed

phrase grammars, existing grammars in these formalisms do not reflect a match to a training

set, nor do they have complete coverage (for out-of-domain or ill-formed word sequences,

they often produce no structure at all). Most problematic for the research explored here,

is that state-of-the-art unification grammars [e.g., Flickinger, 2002, Cahill et al., 2004] do

not provide parse N -best lists with the probability of each parse in the list, which is used

in some of our work for taking expectations over parse alternatives.

Instead of a unification grammar like the ones above, this work uses statistical proba-

bilistic context-free grammar (PCFG) parsers. These sorts of parsers (e.g. Collins [2003],

Charniak [2000], and Petrov and Klein [2007]) use lexical and span-label information from

a training set of hand-labeled trees known as a treebank, e.g. the Penn Treebank of English

[Marcus et al., 1993], and construct syntactic structures on new sentences (in the same

language) consistent with the grammar inferred from these training sentences. Because

they are probabilistic, these parsers may return not only a “best” parse analysis according

to its model, but also a list of n analyses reflecting the n-best parse structures that this

parser (and its grammar) assign to the input sentence. Each carries a probabilistic weight

p(t, w) of the likelihood of a tree t with leaves w. The PCFG estimation makes the context-

free assumption: that the probability of generating the tree is composed of a combination

of probability estimates from tree-local decisions. By constraining the model to use only

tree-local decisions, PCFG models may use dynamic-programming techniques to efficiently

search a very large space of possible tree structures.

13

2.1.3 Treebanks for the PCFG-derived parser

Parsers of this nature are constrained by the availability and structure of treebanks (from

which to learn a grammar). The Penn treebank [Marcus et al., 1993], for example, encodes

span labels over a collection of edited English text (mostly the Wall Street Journal); the

availability of this labeled set has enabled the development of the statistically-trained parsers

for English mentioned above. Recent work to construct treebanks in other languages than

English, e.g. in Chinese [Xue et al., 2002], and in other domains than edited English text

[e.g., Switchboard telephone speech: Godfrey et al., 1992] have made these parsers much

more broadly accessible for use in applications with broader focus than parsing itself. In

particular, Huang and Harper [2009] have built a parser tuned for certain genres of Mandarin

Chinese. Certain aspects of this research depend on the power of this parser to handle

Mandarin news text, despite the relative lack of data (compared to English).

Though these parsers do not explicitly include head structure in their output (to match

the treebanks on which they were trained), all of the state-of-the-art PCFG parsers in-

fer head structure internally, most using Magerman [1995] style context-free headfinding

rules. Recovering the head structure from their output (also using Magerman [1995] style

headfinding) is fast and deterministic, and allows for an easy conversion, when dependency

structure is called for, from treebank-style span trees to headed span trees and thence to

dependency structure.

2.1.4 Intrinsic evaluation of statistical parsing

Statistical parsers are usually evaluated by comparing the hypothesized parse thyp to a

reference parse tref . The standard test uses parseval [Black et al., 1991], an F-measure

over span-precision and span-recall, which was developed for comparison to the Wall Street

Journal treebank [Marcus et al., 1993].

The parseval technique assumes that the hypothesis proposed shares the same word

sequence; that is, parseval is only well-defined when whyp = wref and the basic (sen-

tence) segmentation agrees. If the division of those word sequences into segments differs,

parseval is not well-defined. Kahn et al. [2004] addressed this on reference transcripts of

14

conversational speech by concatenating all reference segment transcriptions from a single

conversation side and computing an F -measure based on error counts at the level of the

conversation side.

In speech applications, however, it is not reasonable to assume that the reference tran-

script is available to the parser, so scoring must compare (thyp, whyp) to (tref , wref ) instead.

In this situation, when comparing parses over hypothesized speech sequences, parse qual-

ity may instead be measured using SParseval [Roark et al., 2006], which computes an

F-measure over syntactic dependency pairs derived from the trees.

Research on parse quality over transcription hypotheses, however, has been very limited.

It has largely been restricted to parsing only the ASR engine’s best hypothesis, e.g., Harper

et al. [2005], which sought to improve the automatic segmentation of ASR transcripts into

utterance-level segments. Approaches like this one that use only one ASR hypothesis ignore

the potential of more parse-compatible alternative transcription hypotheses available from

the ASR engine. Further discussion of the SParseval measure over speech is included in

the background section of chapter 3, which explores parsing speech and speech transcripts.

2.2 Reranking n-best lists

As discussed above, PCFG parsers make strong assumptions about locality in order to

efficiently explore the very large space of possible trees. However, these independence as-

sumptions also prevent the use of feature extraction that crosses that locality boundary. For

example, the relative proportion of noun phrases to verb phrases may be a useful discrimi-

nator among good and bad trees, but this statistic is not computable within the context-free

locality assumptions that go into the parser itself.

An approach to dealing with this challenge is to first generate an n-best list of top-

ranking candidate hypotheses, and then apply discriminative reranking [Collins, 2000]

to re-score the set of candidates (incorporating the original scores as one of the features).

The features available to n-best reranking need not obey the locality assumptions that were

used in generating the candidate list in the first place: rather, the features may be holistic

because they are computed exhaustively (against every member of the n-best list) since n

is much smaller than the original search space. Collins and Koo [2005] and Charniak and

15

Johnson [2005] use this approach to achieve roughly 13% improvements in parseval F

performance on parsing Wall Street Journal text.

2.2.1 Reranking as a general tool

Reranking is of general use, and has been applied elsewhere before being applied to parsing.

In ASR, for example, it was applied to transcription n-best lists to lower word error rate

long before its use in parsing [e.g., Kannan et al., 1992], and Shen et al. [2004] introduce

the use of discriminative reranking in SMT work. Discriminative reranking is a form of

discriminative learning, which seeks to minimize the cost function of the top hypotheses.

Unlike generative models, which learn their parameters from counting occurrences in train-

ing data, n-best rerankers must be trained on hypotheses with explicit evaluation metrics

attached. Reranking has one important extension from the general case of discriminative

learning: in reranking, the ranker must learn which features separate the optimal candidates

from the suboptimal ones by comparing elements only within an n-best list, rather than

pooling all positive and negative examples to seek a margin. One way to do this (discussed

below in section 2.2.2) is to divide a candidate pool into ranks and attempt to separate

each rank from the other ranks. In parsing, for example, it is the relative difference in

(e.g.) prepositional phrase count among candidate parses that is used in reranking, not the

absolute count; candidate parse trees must be compared to other candidates derived from

the same n-best list. The generative component produces overly optimistic n-best lists over

its training data, so in order to provide reranker training with realistic N -best lists from

which to learn weights, the reranker needs to be trained using candidate parses from a data

set that is independent of both the generative component’s training and the evaluation test

set. Because of the limited amount of hand-annotated training and evaluation data, it is

not always preferable to sequester a separate training partition just for this model. Instead,

one may adopt the round-robin procedure described in Collins and Koo [2005]: build N

leave-n-out generative models, each trained on N−1N of the partitioned training set, and run

each on the subset that it has not been trained on. The resulting candidate sets are passed

to the feature-extraction component and the resulting vectors (and their objective function

16

values) are used to train the reranker models.

As already indicated, n-best reranking need not be applied only to parsing. As the

following sections will show, it is useful in other complex natural-language processing tasks,

where the n-best list generator (the generative stage) must obey strong independence as-

sumptions for the sake of efficiency, but the final result may be re-evaluated with new

features (classes of information) applied to the discriminative stage. In addition to using

reranking to improve parse quality, this work also uses n-best reranking as a framework for

applying syntactic information to other tasks.

2.2.2 Reranker strategy used in this work

Within this work, n-best list reranking is treated as a rank-learning margin problem: within

each segment, the task is to separate the best candidate from the other candidate hypotheses.

We adopt the svm-rank toolkit [Joachims, 2006] as our reranking tool. To prepare data for

training this toolkit, the approach adopted here selects the oracle best score on the objective

function φ∗ from the n-best list and converts the objective function into an objective loss

with regard to the oracle for all hypotheses ti, e.g., φl(ti) =∣∣∣φ∗p − φp (ti)

∣∣∣ for the parseval

objective φp. To interpret φl as a rank function, we assign ranks to training candidates that

focus on those distinctions near the optimal candidate, as follows:

rank(ti) =

1 : φl(ti) ≤ ε

2 : ε < φl(ti) ≤ 2ε

3 : 2ε < φl(ti)

(2.1)

where ε is a small value tuned empirically so that ranks 1 and 2 have a small proportion

of the total number of members in the candidate set. Since svm-rank uses all pairwise

comparison between candidates of different rank, and ranks 1 and 2 have very few members,

this approach reduces the number of comparisons from a square in |C| to linear in |C|,

(where |C| represents the number of candidates in the set) while still focusing the margin

costs towards the best candidates.

17

Figure 2.2: The models that contribute to ASR.

2.3 Automatic speech recognition

Automatic speech recognition (ASR) is the process of automatically creating a transcript

word sequence w1 . . . wn from recorded speech waveform α. The literature in this discipline

is enormous, and the survey here skims only the surface, to orient the reader to the basic

models in play in state-of-the-art systems in ASR and to provide context for the contribu-

tions of this work.

2.3.1 A schematic summary of ASR

Speech recognition systems are constructed from multiple models. As illustrated in fig-

ure 2.2, the usual expression of these models (e.g. in the SRI large-vocabulary speech recog-

nizer [Stolcke et al., 2006]) is as a combination of multiple generative models, which operate

together to score possible hypotheses that are pruned down to a list of the top n word

sequence hypotheseses. In large vocabulary systems, the resulting list is typically re-scored

by discriminative components that reorder that list.

Among the generative models, acoustic models pam(α|φ) provide a score of acoustic

features of speech α (typically cepstral vectors) given pronunciation φ; pronunciation models

18

provide a score ppm(φ|w) of pronunciation-representation φ given word w; and language

models (LMs, e.g. Stolcke [2002]; see Goodman [2001]) give a score plm(w1, · · · , wn) of the

word sequence w1, · · · , wn. In decoding, all three of the models descibed above operate

on a relatively small local window: pam(·) uses phone-level contexts, ppm(·) uses the word

in isolation or with its immediate neighbors, and plm(·) most often uses n-gram Markov

assumptions, computing word sequence likelihoods from only the most-recent n− 1 words.

The most typical value for n is three, also known as a “trigram” model, and n rarely exceeds

four or five, due to the computational explosion in storage costs required.

The rescoring component F (α, φ,w1, · · · , wn), by contrast, may use all of the above

scores and also extracts additional features of an utterance- or sentence-length hypothesis

from any of the values mentioned above for use in re-ordering the n-best list. Even with-

out the feature-extraction F (·), the rescoring component may change the relative weight

of the contribution of the upstream models, but F (·) is often used to extract long-distance

(non-local) features that would be expensive or impossible to extract in the local-context

decoding that the other models provide. An exhaustive survey of prior work using rerank-

ing to capture non-local information in ASR is impractical, but the sorts of long-distance

information exploited include topic information, as in Iyer et al. [1994] or more recently

Naptali et al. [2010], or trigger information [Singh-Miller and Collins, 2007]. These model

long-distance effects from as far away as other sentences (or speakers!) in the same dis-

course, not with a syntactic model but with various approaches that cue the activation of

a different vocabulary subset. Another application of reranking operates by adjusting the

output of the generative model to focus on the specific error measure, as in e.g. Roark et al.

[2007]. Further discussion of the use of syntactic information in language-model rescoring

may be found in section 2.3.3.

2.3.2 Evaluation of ASR

Evaluation — and optimization — of speech recognition and its components are carried out

with word error rate (WER), a measure that treats words (or characters) equally, regardless

of their potential impact on a downstream application, as discussed in section 1.1; for

19

example, function words are given equal weight with content words. One exception is that

filled-pauses are, in some evaluations, e.g. GALE [DARPA, 2008], optionally inserted or

deleted without cost when evaluating speech.

A few larger projects that include ASR as a component have suggested extrinsic evalua-

tion methods: in dialog systems, for example, ASR performance is evaluated along with the

other components with a measure of action accuracy (e.g. in Walker et al. [1997] and Lamel

et al. [2000]). In the 2005 Summer Workshop on Parsing Speech [Harper et al., 2005], speech

recognition was evaluated in the extrinsic context of a downstream parser, but only a sin-

gle transcription hypothesis was used. Al-Onaizan and Mangu [2007] explored adjustments

to ASR hypothesis selection in an ASR-to-MT pipeline to allow relatively more insertions

(keeping the WER constant), but found that this made little difference in automatically-

evaluated MT performance.

As an alternative to evaluating ASR with WER or evaluating it directly in the context

of a downstream task, one may instead choose to optimize the ASR towards an improved

form of some intermediate representation (neither the immediate word sequence nor a fully-

extrinsic representation). Hillard et al. [2008], for example, experimented with selecting for

high-SParseval Chinese character-sequences for a downstream Chinese-to-English SMT

system (instead of selecting low character error rate (CER) hypotheses). In follow-up work,

Hillard [2008] found improvement on the automatic SMT measures for unstructured (broad-

cast conversation) genres of speech, though not for structured speech (broadcast news).

Additionally, they found that SParseval measurements of source-language transcription

were better correlated with human assessment of MT performance in the target language

than CER measurements. Intrinsic measures for ASR, however, are almost entirely limited

to WER or its simpler alternative for Chinese, CER.

Chapter 3, which uses parse decoration to rerank ASR transcription hypotheses, evalu-

ates ASR with WER and also with the SParseval parse-quality measure.

20

2.3.3 Parsing in ASR

Efforts to include parsing information in ASR systems have used the parser as an extra in-

formation source for selecting word sequences in speech recognition. This section highlights

a few parser-based language-models and reranking models that have been used in ASR,

demonstrating improvements in both perplexity and WER over n-gram LM baselines.

The structured language model [Chelba and Jelinek, 2000] is a shift-reduce parser that

conditions probabilities on preceding headwords. When interpolated with an n-gram model,

it achieved small improvements in WER on read speech from the Wall Street Journal corpus

and on conversational telephone speech from the Switchboard [Godfrey et al., 1992] corpus.

The top-down PCFG parser used by Roark [2001] achieved somewhat larger improvements

over a trigram on the same set (though the baseline it was compared to was worse than the

baseline in Chelba and Jelinek [2000]). Charniak [2001] implemented a top-down PCFG

parser that conditions probabilities on the labels and lexical heads of a constituent and

its parent. In this model, the probability of a parse is modeled as the product of the

conditional distributions of various structural factors. In contrast to both the models in

Chelba and Jelinek [2000] and Roark [2001], most of these factors are conditioned on the

identity of at most one other lexical item in the tree. This relative reliance on structure

(over lexical identity) makes this model distinctly un-trigram-like. This model gets a lower

perplexity than both the Structured Language Model and Roark’s model on Wall Street

Journal treebank text.

While the details of the parsing algorithms and probability models of the above models

vary, all are fundamentally some kind of PCFG. A non-CFG syntactic language model that

has been used for speech recognition is the SuperARV model, or “almost parsing language

model” [Wang and Harper, 2002], which calculates the joint probability of a string of words

and their corresponding super abstract role values. These values are tags containing part of

speech, semantic and syntactic information. The SuperARV got better perplexity and WER

results than both a baseline trigram and the Chelba and Jelinek [2000] and Roark [2001]

language models, for a variety of read Wall Street Journal corpora. It also out-performed a

state-of-the-art 4-gram interpolated word- and class-based language model on the DARPA

21

RT-02 conversational telephone speech evaluation data [Wang et al., 2004].

Filimonov and Harper [2009] introduce a generalization and extension of the Super-ARV

tagging model in a joint language modeling framework for using very large sets of “tags”,

which (when they include automatically-induced syntactic information in the tag set), was

competitive with the SuperARV performance on both perplexity and WER measures, but

requires less complex linguistic knowledge.

One challenge for combining parsing with ASR is that parsing is ordinarily performed

over well-formed, complete sentences, while automatic segmentation of ASR is difficult,

especially in conversational speech (where even a correct segmentation may not be a syn-

tactically well-formed sentence). Parse models of language do not perform as well on poorly-

segmented text [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005]. In chapter 3, this work

goes into more depth regarding the impact of different methods of automatic segmentation

on the utility of parse decorations and success of parsing.

2.4 Statistical machine translation

Statistical machine translation (SMT) is the process of automatically creating a target-

language word sequence e1 . . . eE from a source-language word sequence f1 . . . fF . There are

non-statistical approaches to this task, e.g. the LOGON Norwegian-English MT project

[Lønning et al., 2004], but these are not the subject of this research. This section offers

an overview of the state-of-the-art in statistical machine translation, identifying the core

models and techniques that are used, the mechanisms for automatic evaluation, and where

syntactic structures are already in use.

2.4.1 A schematic summary of SMT

In SMT based on the IBM models [Brown et al., 1990] and their successors, candidate

translations are understood to be made up of the source words f , the target words e, and also

the alignment a between source and target words. The contributing components are broken

down in a noisy-channel model: a language model plm(e) scores the quality of the target

word sequence; a reordering model pr(a|e) assigns a penalty for the “reordering” performed

22

I don’t like blue cheese .

e{ e1 e2 e3 e4 e5 eE

a{f{ f1 f2 f3 f4 f5 f6 fF

Je n’ aime pas fromage bleu .

Figure 2.3: Word alignment between e and f . Each alignment link in a represents a corre-

spondence between one word in e and one word in f . There is no guarantee that e and f

are the same length (E = F ).

by the alignment, and the translation model ptm(f |e, a) provides a score for pairing source-

language word (or word-group) f with target-language word (or word-groups) e according

to alignment a. This approach formulates the translation decoding process as a search over

words (e) and alignments (a), which is typically approximated as:

argmaxe

p(e|a, f) ∼ argmaxe

ptm(f |e, a)prm(a|e)plm(e) (2.2)

Most current approaches to decoding do not actually use this generative model, but instead

a weighted combination of multiple ptm(·) translation models including both ptm(f |e, a) and

ptm−1(e|f, a), which lack the well-formed noisy-channel generative structure of equation 2.2

above but seem to work better in practice [Och, 2003].

Training the ptm(·) and prm(·) models requires many parallel sentences with alignments

between source and target words, of the form suggested in figure 2.3. Alignments, like parse

structure, are rarely annotated over large amounts of parallel text. The approach offered

by the IBM models and their descendants is to bootstrap alignment and translation models

from bitexts (corpora of parallel sentences). In general, the training of these alignment

and translation models is iterated in a bootstrap process. This bootstrap process, as im-

plemented in popular word-alignment tools, [e.g. GIZA++: Och and Ney, 2003], begins

with simple, tractable models for prm(·) and ptm(·) and, as the models improve, trains more

sophisticated reordering and translation models. Later models are initialized from the align-

23

ments hypothesized by earlier iterations. The language model plm(e) does not participate in

this phase of the training: in a bitext, predicting plm(e) is not helpful; language models are

usually trained separately, using monolingual text. As a byproduct of the parameter-search

to improve these models, the GIZA++ toolkit produces a best alignment linking each word

in e to words in f .

Other tools exist for generating alignments (such as the Berkeley aligner [DeNero and

Klein, 2007]) and there is substantial discussion over how to evaluate and improve the

quality of these alignments. Review of this discussion is passed over here; we will return to

this literature in chapter 5.

Typical independence assumptions in the word-alignment models constrain them to word

sequence and adjacency, applying a penalty for moving words into a different order in

translation. These models for reordering penalties are usually very simple, and do not

incorporate any notion of parse decoration — instead, they assign monotonically-increasing

penalties for moving words in translation. For example, Vogel et al. [1996] uses a hidden

Markov model (HMM, derived from only sequence information) to assign a prm(·) reordering

model. Language-models in translation are also generally sequence-driven: ASR’s basic n-

gram language-modeling approach serves as an excellent baseline to model plm(e) in MT

work. Early stages in the training bootstrapping sometimes ignore even word sequence

information: GIZA++’s “Model 1” treats prm(·) as uniform and ptm(·) as independent of

adjacency information (dependent only on the alignment links themselves).

For language-pairs like French-English, where word-order is largely similar, the local-

movement penalties of these simple prm(·) models usefully constrain the search space of

possible translations to those without large re-ordering: the language- and translation-model

scores will correctly handle any necessary small, local reorderings. For other language-pairs

(e.g., Chinese-English or Arabic-English), though, long-distance re-orderings are necessary,

and these models must assign a small penalty to long-distance movement, which leads to

an explosion in the search space (and a corresponding loss in translation quality).

Having bootstrapped from bitext to word-based alignments, many SMT systems (e.g.

Pharaoh [Koehn et al., 2003] and its open-source successor Moses [Koehn et al., 2007]) take

the bootstrapping farther by automatically extracting a “phrase table” from the aligned

24

Figure 2.4: The models that make up statistical machine translation systems

text. These “phrase-based”2 systems treat aligned chunks of adjacent words as a sort of

translation memory (the “phrase table”) which incorporates local reordering and context-

aware translations into the translation model. Entries in the phrase table are discovered

from the aligned bitext by heuristic selection of observed aligned spans. For some “phrase-

based” systems, such as Hiero [Chiang, 2005], the span-discovery (and decoding) may even

allow nested “phrases” with discontinuities.

Statistical machine translation systems thus, like ASR, use multiple models which con-

tribute together to generate (or “decode”) a scored list of possible hypotheses, as suggested

in the top half of figure 2.4. The “phrase table” incorporates some aspects of alignment

and translation models, but even when phrase tables are quite sophisticated, choosing and

assembling these phrases at run-time usually requires additional translation and alignment

models, even if only to assign appropriate penalties to the assembly of phrase-table entries.

The n-best list generated by the decoder is typically re-scored using a discriminative re-

2“Phrase-based” SMT systems use the term “phrase” to refer to a sequence of adjacent words; these donot have any guarantee of relating to a syntactic or otherwise linguistically-recognizable phrase. Xia andMcCord [2004] use “chunk-based” to refer to these systems but this expression has not been widely adopted.This work uses “phrase-based” for consistency with the literature (which describes “phrase-based” SMTand “phrase tables”), despite the infelicity of the expression.

25

ranking component, as outlined in figure 2.4, that takes into account the language-model,

translation-model, alignment-model and phrase-table scores already mentioned, and may

also incorporate additional features F (a, e, f) that are difficult to include in the decoding

process that generates the original n translation hypotheses. The re-ranking component

relies on an automatic measure of translation quality which is computable without human

intervention for a given hypothesis translation and one or more reference translations.

2.4.2 Evaluation measures for MT

The development of reliable automatic measures for optimization has changed the field of

statistical machine translation, by allowing the discriminative training of rescoring and re-

weighting models, such as minimum error rate training [MERT: Och, 2003], and by providing

a shared measure for success.

In MT, evaluation is a complex process, in large part because two (human) translators

asked to perform the same translation task may quite ordinarily produce very different re-

sulting strings. The challenge of accounting for allowable variability is not shared with ASR;

in ASR, two human transcribers will usually agree on most of the transcription. Instead of a

string match to a reference translation, human-assessed measures of translation quality are

traditionally broken into separate scales of fluency and adequacy to assess system quality

(whether translations are performed by human or machine) [King, 1996, LDC, 2005]. Of

course, fluency and adequacy judgements cannot be performed without a human evalua-

tor.3 Comparing system translations to reference translations allows monolingual assessors,

which reduces the cost by increasing the available pool of assessors. In many evaluations,

automatic measures compare automatic translations to these reference translations; these

automatic measures have the virtue of removing annotator variability from the evaluation

and further reducing the labor costs of assessing the system translations. For optimization

purposes (such as the MERT models and discriminative re-ranking described above), a mea-

sure that operates without human intervention is required, because the rescoring models

3One might think that fluency and adequacy judgements require a bilingual evaluator as well, but for eval-uating MT quality, a monolingual (in the target language) evaluator can compare machine and referencetranslations of the same text to report these judgements.

26

operate over hundreds (or thousands!) of sample translations of the same sentence.

The two most popular of the automatic metrics are the BLEU [Papineni et al., 2002] mea-

sure of n-gram precision and the TER [Snover et al., 2006] edit distance. BLEU [Papineni

et al., 2002], a measure of n-gram precision, remains the most popular and widely-reported

measure for measuring translation quality against a reference translation (or set of reference

translations). BLEU is a geometric mean of precisions over varying N -gram lengths:

BLEUn(h; r) = n

√√√√ n∏i=1

πi(h; r) · BP (h, r) (2.3)

where πi(h; r) reflects the precision of the i-grams in hypothesis h with respect to reference

r, and the term BP(h, r) is a “brevity penalty” to discourage the production of extremely

short (low-recall, high-precision) translations:

BP(h,r) =

exp(1− |r||h|

)if |h| < |r|

1 if |h| ≥ |r|

Most results are reported with BLEU4.

Translation Edit Rate (TER) is an error measure like WER, which measures the oper-

ations required to transform hypothesis h into reference r:

TER(h; r) =insertions(h; r) + deletions(h; r) + substitutions(h; r) + shifts(h; r)

length(r)(2.4)

where insertions, deletions and substitutions count one per word, while shift operations

move any adjacent sequence of words from one position in h to another. Insertion, deletion,

substitution and shift error counts are calculated through an alignment between reference

and hypothesis that heuristically minimizes the total number of operations needed.

When working with multiple references, BLEU4 is defined so that its n-grams may match

those in any of the references, allowing translation variability across the multiple references,

but TER’s approach to multiple references is just to return the minimum edit ratio over

the set of references, which is less forgiving to the candidate translation.

Like word error rate for ASR, the BLEU and TER metrics use no syntactic or argument-

structure modeling to determine which words matter more: all words are treated equally.

In TER, substituting or shifting a single word incurs the same cost regardless of where the

27

substitution or shift happens; in BLEU, all hypothesis n-grams contribute equally to the

score of the sentence. Because of the emphasis on these automatic measures, innovations in

MT have often focused on the innovations’ effects on these measures directly, sometimes to

the point of reporting only on one of these entirely automatic measures.

Some have raised skepticism towards the focus on the BLEU and TER automatic mea-

sures on theoretical [Callison-Burch, 2006] and empirical [Charniak et al., 2003] grounds, in

that they do not always accurately track translation quality as judged by a human annota-

tor, and they may not even reliably separate professional from machine translations [Culy

and Riehemann, 2003]. Other automatic MT measures have been proposed, some of which

use parse decorations. Chapter 4 describes some of these alternatives in more detail.

An ideal automatic measure would correlate well with human judgements of translation

quality. However, judgements of fluency and adequacy themselves are highly variable across

annotators. Rather than correlate with these measurements, one may instead examine the

correlation with a different human-derived measure of translation quality: Snover et al.

[2006] propose Human-targeted Translation Edit Rate (HTER), a measurement of the work

performed by a human editor to correct the translation until it is equivalent to the reference

translation. They show that a single HTER score is very well-correlated to fluency/adequacy

judgements, and has lower variance: they find that a single HTER score is more predictive of

a held-out fluency/adequacy judgement than a single fluency/adequacy judgement. HTER

still requires human intervention, but, probably because of its consistency in evaluation,

it has been adopted as the evaluation standard for the DARPA GALE project [DARPA,

2008].

2.4.3 Parsing in MT

Early explorations of the application of syntactic structure to SMT were explored as an

alternative to the phrase-table approach. Yamada and Knight [2001] and Gildea [2003] in-

corporate operations on a treebank-trained target-language parse tree to represent p(f |a, e)

and p(a|e), but have no “phrase” component; Charniak et al. [2003] apply grammatical

structure to the p(e) language-model component. These approaches met with only moder-

28

ate success.

Rather than building a syntactic model into the decoder or language model, others pro-

posed automatically [Xia and McCord, 2004, Costa-jussa and Fonollosa, 2006] and manually

coded [Collins et al., 2005a, Popovic and Ney, 2006] transformations on source-language

trees, to reorder source sentences from f to f ′ before training or decoding (translation

models are trained on bitexts with f ′ and e). Zhang et al. [2007] extend this approach

by inserting an explicit source-to-source “pre-reordering” model pr0(f′|f) to provide lattice

input alternatives to the main translation.

The phrase-table models described in section 2.4.1 capture some local syntactic struc-

ture — even when the phrases are simply reliably-adjacent word-sequences — by virtue

of recording actually-observed n-grams in the source- and target-language sequences, but

these models offer additional power when they are made syntactically aware. Syntactically-

aware decoders are united with the phrase-table approach in such approaches as the ISI

systems [Galley et al., 2004, 2006, Marcu et al., 2006], the systems built by Zollmann et al.

[2007], and recently the Joshua open-source project [Li et al., 2009]. Each of these builds

syntactic trees over the target side of the bitext in training and learn phrase-table entries

with syntactically-labeled spans. Conversely, Quirk et al. [2005] and Xiong et al. [2007]

construct phrase-table entries using source-language dependency structure, while Liu et al.

[2006a] applies a similar technique using constituent structure instead of dependency.

Rather than pursue these phrase-table based decoder models directly, chapter 5 of this

work explores mechanisms to use parsers to improve the word-to-word alignments that are

the material from which the phrases are learned.

2.5 Summary

This chapter has provided an overview of four key technologies for the remainder of this

work: statistical parsing, n-best list reranking, automatic speech recognition, and statistical

machine translation. Special attention is paid to the interaction of parsers with speech

recognition, the evaluation of speech recognition and machine translation, and the existing

roles of syntactic structure in statistical machine translation. The next three chapters

use parsers (and rerankers) in various combinations on conversational speech recognition

29

(chapter 3), machine translation evaluation (chapter 4), and on improving word alignment

quality for machine translation (chapter 5). Further details on related work more directly

related to this thesis are provided in each chapter.

31

Chapter 3

PARSING SPEECH

Parse-decoration on the word sequence has a strong potential for application in the

domain of automatic speech recognition (ASR). Extracting syntactic structure from speech

is more challenging than ASR or parsing alone, because the combination of these two stages

introduces the potential for cascading error, and most parsing systems assume that the leaves

(words) of the syntactic tree are fixed. This chapter1 applies parse structure as an additional

knowledge source, even when the evaluation targets do not include parse structure explicitly.

It also considers the benefits to parsing of considering alternative speech transcripts (when

the evaluation targets are parse measures themselves).

We thus consider recognition and parsing as a joint reranking problem, with uncertainty

(in the form of multiple hypotheses) in both the recognizer and parser components. In this

joint problem, there are two possible targets: word sequence quality, measured by word

error rate (WER), and parse quality, measured over speech transcripts by SParseval. For

both these targets, sentence boundary concerns have largely been ignored in prior work:

speech recognition research has generally assumed that sentence boundaries do not have

a major impact, since the placement of segment boundaries in a string does not affect

WER on that string. Parsing research, on the other hand, has generally assumed that

sentence boundaries are given (usually by punctuation), since most parsing research has

been on text. Spoken language, unlike written language, does not have explicit markers for

sentence and paragraph breaks; i.e., punctuation is not verbalized. Sentence boundaries in

spoken corpora must therefore be automatically recognized, introducing another source of

difficulty for the joint recognition-and-parsing problem, regardless of the target: sentence

segmentation.

1Tthe work presented in this chapter is included in a paper that has been accepted to Computer Speechand Language.

32

Although there has been a substantial amount of research on speech recognition, seg-

mentation of spoken language, and parsing (as described in the next section), there has

been little work exploring automation of all three together. Most research has incorporated

only one or two of these areas, typically treating recognition and parsing as separable pro-

cesses. In this chapter, we combine recognition and parsing using discriminative reranking:

selecting optimal word sequences from the N -best word sequences generated from a speech

recognizer given cues from M parses for each, and selecting optimal parse structure from the

N ×M -best parse structures associated with these word sequences. At the same time, we

explore the impact of automatic segmentation. We ask the following inter-related questions:

• In the task of extracting parse structure from conversational speech, how much can

we improve performance by exploiting the uncertainty of the speech recognizer?

• In the word recognition task, does a discriminative syntactic language model benefit

from incorporating parse uncertainty in parse feature extraction?

• How does segmentation affect the usefulness of parse information for improving speech

recognition, and what is its impact on parsing accuracy, given alternative word se-

quences and alternative parse hypotheses?

Section 3.1 discusses the relevant background for this research integrating speech segmen-

tation, parsing, and speech recognition. Section 3.2 outlines the experimental framework in

which this chapter explores those questions, while section 3.3 describes the corpus and the

configuration of the various components of this system. Section 3.4 describes the results of

those experiments, and section 3.5 discusses these results in the context of the dissertation

as a whole.

3.1 Background

Our approach to parsing conversational speech builds on several active research areas in

speech and natural language processing. This section extends the review from chapter 2 to

highlight the prior work most related to the work in this chapter.

33

3.1.1 Parsing on speech and its evaluation

As discussed in section 2.1.4, most parsing research has been developed with the parseval

metric [Black et al., 1991], which was inititally developed for parse measurement on text.

It was used in initial studies of speech based on reference transcripts (without considering

speech recognizer errors). The grammatical structures of speech are different than those of

text: for example, Charniak and Johnson [2001] demonstrated the usefulness (as measured

by parseval) of explicit modeling of edit regions in parsing transcripts of conversational

speech.

Unfortunately, parseval is not well-suited to evaluating parses of automatically-recognized

speech. In particular, when the words (leaves) are different between reference and hypoth-

esized trees (as will be the case when there are recognition errors), it is difficult to say

whether a particular span is included in both, and the parseval measure is not well de-

fined. Roark et al. [2006] introduce alternative scoring methods to address this problem

with SParseval, a parse evaluation toolkit. The SParseval method used here takes into

account dependency relationships among words instead of spans. Specifically, CFG trees

are converted into dependency trees using a head-finding algorithm and head percolation of

the words at the leaves. Each dependency tree is treated as a bag of triples 〈d, r, h〉 where

d is the dependent word, r is a symbol describing the relation, and h is the dominating

lexical headword (central content word in the phrase). Arc-labels r are determined from

the highest constituent label in the dependent and the lowest constituent label dominating

the dependent and the head. SParseval describes the overlap between the “gold” and

hypothesized bags-of-triples in terms of precision, recall and F measure.

Overall, SParseval allows a principled incorporation of both word accuracy and accu-

racy of parse relationships. Since every triple (the dependency-pair and its link label, as

in figure 3.1) involves two words, this measure depends heavily on word accuracy, but in a

more complex way than word error rate, the standard speech recognition evaluation met-

ric. Figure 3.1 demonstrates a number of properties of the SParseval measure. Although

both (b) and (c) have the same word error (one substitution each), they have very different

precision and recall behavior. As the figure suggests, the SParseval measure over-weights

34

(a) S/think

NP/I

I

VP/think

AdvP/really

really

VP/think

V/think

think

AdvP/so

so

(I, S/NP, think)

(really, VP/AdvP, think)

(think, <s>/S, <s>)

(so, VP/AdvP, think)

(b) S/think

S/think

NP/I

I

VP/think

AdvP/really

really

VP/think

V/think

think

DM/yeah

yeah

Precision = 34 , Recall = 3

4

Word Error Rate = 14

(I, S/NP, think)

(really, VP/AdvP, think)

(think, <s>/S, <s>)

(yeah, S/DM, think)

(c) S/sink

NP/I

I

VP/sink

AdvP/really

really

VP/sink

V/sink

sink

AdvP/so

so

Precision = 04 , Recall = 0

4

Word Error Rate = 14

(I, S/NP, sink)

(really, VP/AdvP, sink)

(sink, <s>/S, <s>)

(so, VP/AdvP, sink)

Figure 3.1: A SParseval example that includes a reference tree (a) and two hypothesized

trees (b,c) with alternative word sequences. Each tree lists the dependency triples that

it contains; bold triples in the hypothesized trees indicate triples that overlap with the

reference tree. Although all have the same parse structure, tree (c) is penalized more

heavily (no triples right) because it gets the head word think wrong.

35

“key” words, making SParseval a joint measure of word sequence and parse quality. All

words appear exactly once in the left (dependent) side of the triple, but only the heads of

phrases appear on the right. Thus, those words that are the lexical heads of many other

words (such as think in the figure) are multiply-weighted by this measure. Head words are

multiply weighted because getting head words wrong impacts not only the triples where

that head word is dependent on some other token, but also the triples where some other

word depends on that head word. Non-head words are not involved in so many triples.

In this work, we use SParseval as our measure of parse quality for parses produced over

speech recognition transcription hypotheses.

3.1.2 Speech segmentation

Speech transcripts offer another challenge for parsing, whether used as an evaluation or

a knowledge source. We showed that parser performance (as measured by an adapted

parseval) degrades significantly when using automatically-detected (rather than reference)

sentence boundaries [Kahn et al., 2004, Kahn, 2005], even when the speech transcripts are

entirely accurate.

Extending the same lines of research, Harper et al. [2005] used SParseval to assess the

impact of automatic segmentation on parse quality, but using automatic word transcrip-

tions as well. Their work focuses on selecting segmentations from a fixed word sequence and

providing a top-choice parse for each of those segments. As we [Kahn et al., 2004] previ-

ously found on reference transcripts, they show a negative impact of segmentation error on

ASR hypothesis transcripts, and further show that optimizing for minimum segmentation

error does not lead to the best parsing performance. Parser performance, rather, benefits

more from higher segment-boundary recall (i.e., shorter segments). They do not, however,

consider alternative speech recognition hypotheses, which is an important focus of the work

in this chapter.

Though choosing a different segmentation does not affect the WER error measure (as

it would for SParseval), choosing alternate segmentations affects ASR language model-

ing, even in the absence of a parsing language model, because even n-gram language models

36

assume that segment-boundary conditions are matched between LM training and test: Stol-

cke [1997] demonstrated that adjusting pause-based ASR n-best lists to take into account

segment boundaries matched to language model training data gave reductions in word error

rate.

3.1.3 Parse features in reranking

Section 2.3.3 discussed general approaches to using parsing as a language model, including

parsing language-models like Chelba and Jelinek [2000] and Roark [2001]. Reranking, as

discussed in section 2.2, is applied to parsers [Collins and Koo, 2005] but also to language-

modeling for ASR, with [e.g., Collins et al., 2005b] and without [Roark et al., 2007] parse

features.

Collins et al. [2005b] does discriminative reranking using features of the parse structure

extracted from a single-best parse of the English ASR hypothesis. Arisoy et al. [2010] used

a similar strategy for Turkish language modeling. In both cases, the objective was the

minimization of WER. Harper et al. [2005] and others, as mentioned above, use reranking

with the parsing objective over automatic speech transcripts. However, neither the syn-

tactic language-modeling work using syntax nor the parsing work using automatic speech

transcripts considers the variable hypotheses of both the speech recognizer and the parser in

a reranking context. Using both variables together is the approach pursued in this chapter.

3.2 Architecture

The system for handling conversational speech presented in this chapter is illustrated

schematically in figure 3.2 and involves the following steps:

1. a speech recognizer, which generates speech recognition lattices with associated

probabilities from an audio segment (here, a conversation side);

2. a segmenter which detects sentence-like segment boundaries E, given the top word

hypothesis from the recognizer and prosodic features from the audio;

37

Figure 3.2: System architecture at test time.

3. a resegmenter which applies the segment boundaries E to confusion networks de-

rived from the lattices and generates an N -best word hypothesis cohort W s for each

segment s, made up of word sequences wi with associated recognizer posteriors pw(wi)

for each of the N sequences wi ∈W s;

4. a parser component which generates an M -best list of parses ti,j , j = 1, . . . ,M ,

for each wi ∈ W s, along with confidences pp(ti,j , wi) for each parse over each word

sequence (all the ti,j for a given segment s make up the parse cohort T s)

5. a feature extractor which extracts a vector of descriptive features fi,j over each

member of the parse structure cohort which together make up the feature cohort F s;

and

6. a reranker component which selects an optimal vector of features (and thus a pre-

ferred candidate) from the cohort and effectively chooses an optimal 〈w, t〉, which

38

Figure 3.3: n-best resegmentation using confusion networks

maximize performance with respect to some objective function on the selected candi-

date and the reference word transcripts and parse-tree.

In the remainder of this section we describe the components created for this joint-problem

architecture: the resegmenter (step 3), the features chosen in the feature extractor (step 5),

and the re-ranker itself (step 6). We describe the details of each component’s configuration

in section 3.3.2.

3.2.1 Resegmentation

This chapter compares multiple segmentations of the word stream, including the ASR-

standard pause-based segmentation, reference sentence boundaries, and two cases of auto-

matically detected sentence-like units. Since the recognizer output is based on pause-based

segmentation, a resegmenter (step 3) is needed to generate N-best hypotheses for the al-

ternative segmentations, taking recognizer word lattices and a hypothesized segmentation

as input. The resegmentation strategy is depicted in Figure 3.3. First, the lattices from

step 1 are converted into confusion networks, a compact version of lattices which consist

39

of a sequence of word slots where each slot contains a list of word sequence hypotheses

with associated posterior probabilities [Mangu et al., 2000]. Because the slots are linearly

ordered, they can be cut and rejoined at any inter-slot boundary. All the confusion net-

works for a single conversation side are concatenated. Speaker diarization (the relationship

between this conversation side and the transcription of the interlocutor) is not varied. The

concatenated confusion network is then cut at locations corresponding to the hypothesized

segment boundaries, producing a segmented confusion network. Each candidate segmenta-

tion produces a different re-cut confusion network.

These re-cut confusion networks are used to generate W s, an N -best list of transcription

hypotheses, for each hypothesized segment s from the target segmentation. Each transcrip-

tion wi of W s has a recognizer confidence pr(wi), calculated as

pr(wi) =

len(wi)∏k=1

pr(wik) (3.1)

where pr(wik) is the confusion network confidence of the word selected for wi from the k-th

slot in the confusion net. This posterior probability pr(wik) is derived from the recognizer’s

forward-backward decoding where the acoustic model, language model, and posterior scaling

weights are tuned to minimize WER on a development set.

3.2.2 Feature extraction

After creating the parse cohort T s from the word-sequence cohort W s, each member of the

parse cohort is a word-sequence hypothesis wi with a parse tree ti,j projected over it, along

with two confidences: the ASR system posterior pr(wi) and the parse posterior pp(ti,j , wi).

The feature-extraction step (step 5) extracts additional features and organizes all of these

features into a vector fi,j to pass to the reranker. The feature extraction is organized to

allow us to vary fi,j to include different subsets of those extracted.

In this subsection, we present three classes of features extracted from our joint recognizer-

parser architecture: per-word-sequence features, generated directly from the output of

the recognizer and resegmenter and shared by all parse candidates associated with a tran-

scription hypothesis wi; per-parse features, generated from the output of the parser,

40

Table 3.1: Reranker feature descriptions for parse ti,j of word sequence wi

Feature Description Feature Class

pr(wi) Recognizer probability per-word-

sequence

features

Ci Word count of wi

Bi True if wi is empty

pp(ti,j , wi) Parse probabilityper-parse features

ψ(ti,j) Non-local syntactic features

pplm(wi) Parser language model aggregated parse

featuresE[ψi] Non-local syntactic feature expectations

which are different for each parse hypothesis ti,j ; and aggregated-parse features, con-

structed from the parse candidates but which aggregate across all ti,j that belong to the

same wi. The features are listed in Table 3.1. All of the probability features p(·) are

presented to the reranker in logarithmic form (values −∞ to 0).

Per-word-sequence features

Two recognizer outputs are read directly from the N -best lists produced in step 3 and

reflect non-parse information. The first is the recognizer language-model score pr(wi), which

is calculated from the resegmenter’s confusion networks as described in equation 3.1. A

second recognizer feature is the number of words Ci in the word hypothesis, which allows

the reranker to explicitly model sequence length. Lastly, an empty-hypothesis indicator Bi

(where Bi = 1 when Ci = 0) allows the reranker to learn a score to counterbalance for lack

of a useful parse score. (It is possible that a segment will have some hypothesized word

sequences wi that have valid words and some that contain only noise, silence or laughter,

i.e., an empty hypothesis, which would have no meaningful parse.)

41

Per-parse features

Each parse ti,j has an associated lexicalized-PCFG probability pp(ti,j , wi) returned by the

parser. For the parse quality objective, our system needs to compare parses generated

from different word hypotheses. The joint probability p(t, w) contains information about

the word sequence (the marginal parsing language model probability p(w) =∑t p(t, w))

and the parse for that word sequence p(t|w). For the two objectives of parsing and word

transcriptions, it is useful to factor these. Parse-specific features are described here, and in

the next section we consider features that are aggregated over the M-best parses.

For parsing, we compute the probabilities

pp(ti,j |wi) =pp(ti,j , wi)∑Mk=1 pp(ti,k, wi)

(3.2)

that represent the proportion of the M -best parser probability mass for sequence wi assigned

to tree ti,j .

The score pp(·) described above models the parser’s confidence in the quality of the entire

parse. Following the parse-reranking schemes sketched in section 3.1.3, we also extract

non-local parse features: a vector of integer counts ψ(ti,j) extracted from parse ti,j and

reflecting various aspects of the parse topology, using the feature-extractor from Charniak

and Johnson [2005]. These features are non-local in the sense that they make reference to

topology outside the usual context-free condition. For example, one element of this vector

might count, in a given parse, the number of VPs of length 5 and headed by the word think.

Further examples of the sorts of components in ψ(·) may be found in Charniak and Johnson

[2005]. Because these features are often counts of the configurations of specific words or non-

terminal labels, ψ(ti,j) is a very high-dimensional vector, which is pruned at training time

for the sake of computational tractability. For each segmentation condition, we construct a

different definition of ψ, keeping only those features whose values vary between candidates

for more than k segments of the training corpus.

When Ci = 0, we assign exactly one dummy tree [S [-NONE- null]] to the empty word

sequence, set pp(ti,1, wi) to a value very close to zero, and derive ψ(ti,1) from the dummy

tree using the same feature extractor. pp(ti,1|wi) is set to unity since there is only one

(dummy) parse available.

42

Aggregated-parse features

For the WER objective, the details of specific parses are not of interest, but rather their ex-

pected behavior given the distribution over possible trees {p(t|w)}. We calculate the “parser

language model” feature pplm(wi) by summing the probabilities of all parse candidates for

wi:

pplm(wi) =M∑k=1

pp(ti,k, wi). (3.3)

We also aggregate our non-local syntactic feature vectors ψ(ti,j) across the multiple parses

ti,j associated with a single word sequence wi by taking the (possibly flattened) expectation

over the conditional parse probabilities:

E[ψi] =M∑j=1

pp(ti,j |wi)ψ(ti,j) (3.4)

We further investigated flattening the parse probabilities (i.e. replace pp(ti,k, wi) with

pp(ti,k, wi)γ for 0 < γ ≤ 1) under the hypothesis that they were “over-confident”, which is

useful in chapter 4 (also published as Kahn et al. [2009]).

3.2.3 Reranker

The reranker (step 6) takes as input the feature vector fi,j for each candidate and applies

a discriminative model θ to sort the cohort candidates. θ is learned by pairing feature

vectors fi,j with a value assigned by an external objective function φ, and finding a θ

that optimizes the cumulative objective function of the top-ranked hypotheses over all the

training segments s. In this work, we consider two alternative objective functions: word

error rate (WER), for targeting the word sequence (φw(wi)), and SParseval for evaluating

parse structure (φp(ti,j)). For the SParseval objective, the optimization problem is given

by:

θ = argmaxθ

∑s

φp

argmintsi,j

θ · f si,j

(3.5)

A similar equation results for the word error rate objective, but the minimization is only

over word hypotheses wi.

43

For training the re-ranker component of our system, we need segment-level scores, since

we apply the re-ranking per-segment. Ideally, scoring operates on the concatenated result

for a whole conversation side, to avoid artifacts of mapping word sequences associated

with hypothesized segments that differ from the reference segments. However, in training,

where scoring is needed for all M × N hypotheses, it is prohibitively complex to score

all combinations at the conversation-side level. We therefore approximate per-segment

SParseval scores at training time by computing precision and recall against all those

reference dependency pairs whose child dependent aligns within the segment boundaries

available to the parses being ranked.

We treat reranker training as a rank-learning margin problem, using the svm-rank

toolkit [Joachims, 2006] as described in section 2.2.2.

3.3 Corpus and experimental setup

These experiments used the Switchboard corpus [Godfrey et al., 1992], a collection of English

conversational speech. The Switchboard corpus consists of five-minute telephone conversa-

tions between strangers on a randomly-assigned topic. The audio is recorded on different

channels for each speaker. The data from a single channel is referred to as a “conversation

side.”

All the data used in this experiment was taken from a subset of Switchboard conversa-

tions which the Linguistic Data Consortium (LDC) has annotated with parse trees. These

experiments use the original LDC transcriptions of the Switchboard audio rather than the

subsequent Mississippi State transcriptions [ISIP, 1997], because hand-annotated reference

parses only exist for the former. The data also has manual annotations of disfluencies and

sentence-like units [Meteer et al., 1995], labeled with reference to the audio. Because the

treebanking effort used only transcripts (no audio), there are occasional differences in the

definition of a “sentence”; because the audio-based annotation was likely to be more faithful

to the speaker’s intent, and because the automatic segmenter was trained from data anno-

tated with a related LDC convention [Strassel, 2003], we used the audio-based definition of

sentence-like units (referred to henceforth as SUs).

The Switchboard parses were preprocessed for use in this system following methods

44

Table 3.2: Switchboard data partitions

Partition Sides Words

Train 1042 654271

Dev 116 76189

Eval 128 58494

described in Kahn [2005], which are summarized here. Various aspects of the syntactic an-

notation beyond the scope of this task—for example, empty categories—were removed. The

parses were also resegmented to match the SU segments, with some additional rule-based

changes performed to make these annotations more closely match the LDC SU conventions.

In the resegmented trees, constituents spanning manually-annotated segment boundaries

were discarded, and multiple trees within a single manually annotated segment were sub-

sumed beneath a top-level SUGROUP constituent. To match the speech recognizer output,

punctuation is removed, and contractions are retokenized (e.g., can + n’t ⇒ can’t).

The corpus was partitioned into training, development and evaluation sets whose sizes

are shown in Table 3.2. Results are reported on the evaluation set; the development set was

used during debugging and for exploring new feature-sets for f , but no results from it are

reported here.

3.3.1 Evaluation measures

Word recognition performance is evaluated using word-error rate measurements generated

by the NIST sclite scoring tool [NIST, 2005] with the words in the reference parses taken

as the reference transcription. Because we want to compare performance across different

segmentations, WER is calculated on a per-conversation side basis, concatenating all the

top-ranked word sequence hypotheses in a given conversation side together. When com-

paring the statistical significance of different results between configurations, the Wilcoxon

Signed Rank test provided by sclite is used.

For parse-quality evaluation, we use the SParseval toolkit [Roark et al., 2006], again

45

calculated on a per-conversation side basis, concatenating all the top-ranked parse hypothe-

ses in a given conversation. We use the setting that invokes Charniak’s implementation

of the head-finding algorithm and consider performance over both closed- and open-class

words. When comparing the statistical significance of SParseval results, we use a per-

segment randomization [Yeh, 2000].

3.3.2 Component configurations

Speech recognizer

The recognizer is the SRI Decipher conversational speech recognition system [Stolcke et al.,

2006], a state-of-the-art large-vocabulary speech recognizer that uses various acoustic and

language models to perform multiple recognition and adaptation passes. The full system has

multiple front-ends, each of which produce n-best lists containing up to 2000 word sequence

hypotheses per audio segment, which are then combined into a single set of word sequence

hypotheses using a confusion network. This system has a WER of 18.6% on the standard

NIST RT-04 evaluation test set.

Human-annotated reference parses are required for all the data involved in these exper-

iments. Unfortunately, because they are difficult to create, reference parses are in short

supply, and all the Switchboard conversations used in the evaluation of this system are

already part of the training data for the SRI recognizer. Although it represents only a very

small part of the training data (Switchboard is only a small part of the corpus, and the

data here are restricted to the hand-parsed fraction of Switchboard), there is the danger

that this will lead to unrealistically good recognizer performance. This work compensates

for this potential danger by using a less powerful version of the full recognizer, which has

fewer stages of rescoring and adaptation than the full system and a WER of 20.2% on the

RT-04 test set. On our evaluation set from Switchboard, this system has a 22.9% WER.

Segmenter

Our automatic segmenter [Liu et al., 2006b] frames the sentence-segmentation problem as a

binary classification problem in which each boundary between words can be labeled as either

46

a sentence boundary or not. Given a word sequence and prosodic features, it estimates the

posterior probability of a boundary after each word. The particular version of the system

used here is based on the hidden-event model (HEM) from Stolcke and Shriberg [1996],

with features that include n-gram probabilities, part of speech, and automatically-induced

semantic classes, and combines the lexical and prosodic information sources. The HEM is

an HMM with a higher-order Markov process on the state sequence (the word-boundary

label pair) and observation probabilities given by the prosodic information using bagging

decision trees. Segment boundaries were hypothesized for all word boundaries where the

posterior probability of a sentence boundary was above a certain threshold.

We explore four segmentation conditions in our experiments:

Pause-based segmentation uses the recognizer’s “automatic” segments, which are deter-

mined based on speech/non-speech detection by the recognizer (i.e., pause detection);

it serves as a baseline.

Min-SER segmentation is based on the automatic system using a posterior threshold of

0.5, which minimizes the word-level slot error rate (SER).

Over-segmented segmentation is based on the automatic system using a posterior thresh-

old of 0.35, which is that suggested by Harper et al. [2005] for obtaining better parse

quality.

Reference segmentation is mapped to the hypothesized word sequence by performing a

dynamic-programming alignment between the confusion networks and the reference;

it provides an oracle upper bound.

Table 3.3 summarizes the segmentation conditions, including the performance (measured

as SU boundary F and SER), the number of segments and the average segment length

in words for each segmentation condition on the evaluation set. Note that the automatic

segmentation with the lower threshold results in more boundaries, so that the average

“sentence” length is shorter and recall is favored over precision.

47

Table 3.3: Segmentation conditions. F and SER report the SU boundary performance over

the evaluation section of the corpus.

Segmentation # Segments Average

condition threshold F SER Train Eval length

Pause-based NA 0.62 0.61 54943 5693 10.3

Min-SER 0.5 0.77 0.45 86681 8417 6.9

Over-segmented 0.35 0.78 0.46 96627 9369 6.2

Reference NA (1.00) (0.00) 91254 8779 6.7

Resegmenter

Given the confusion network representation of the speech recognition output, the main

task of resegmentation is generating N -best lists given a new segmentation condition for

the confusion networks. For a given segment, the lattice-tool program from the SRI

Language Modeling Toolkit [Stolcke, 2002] is used to find paths through the confusion

network ranked in order of probability, so the N most probable paths are emitted as an

N -best list 〈w1 . . . wN 〉, where each wi is a sequence of words. For these experiments, the

N -best lists are limited to at most N = 50 word sequence hypotheses.

Parser

Our system uses an updated release of the Charniak generative parser [Charniak, 2001] (the

first stage of the November 2009 updated release of [Charniak and Johnson, 2005], without

the discriminative second-stage component) to do the M -best parse-list (and parse-score)

generation. As in Kahn [2005], we do not implement a separate “edit detection” stage

but treat edits as part of the syntactic structure. The parser is trained on the entire

training set’s reference parses; no parse trees from other sources are included in the training

set. We generate M = 10 parses for each word sequence hypothesis, based on analyses

(presented later) that showed little benefit from additional parses and much more benefit

from increasing the number of sentence hypotheses. If the parser generates less than M

48

hypotheses, we take as many as are available. For the full system, we train a single parser

on the entire training set; for providing training cohorts to the reranker, the parser is trained

on round-robin subsets of the training set, as discussed in section 3.3.2.

Feature extractor

The extraction of non-local syntactic feature ψ(ti,j) uses the software and feature definitions

from Charniak and Johnson [2005]. For tractability, we prune the set of features to those

with non-zero (and non-uniform) values within a single segment’s hypothesis set for more

than 2000 segments, which is approximately 2% of the total number of training segments

(as in the parse-reranking experiments in Kahn et al. [2005]). Pruning is done separately

for each segmentation of the training set, yielding about 40,000 non-local syntactic features

under most segmentation conditions.2

The aggregate parse features pplm(wi) and E[ψi] are calculated by sums across the M

parses generated for each wi. We assume that this approximation (instructing the parser to

return no parses after the M -th) has no important impact on the value of these features.

Reranker

As discussed in section 2.2, the reranker component of our system is the svm-rank tool from

Joachims [2006]. The reranker needs to be trained using candidate parses from a data set

that is independent of the parser training and the evaluation test set. Because of the limited

amount of hand-annotated parse tree data, we did not want to create a separate training

partition just for this model. Instead, we adopt the round-robin procedure described in

Collins and Koo [2005]: we build 10 leave-n-out parser models, each trained on 9/10 of the

training set, and run each on the tenth that it has not been exposed to. The resulting parse

candidate sets are passed to the feature-extraction component and the resulting vectors

(and their objective function values) are used to train the reranker models.

2Our non-local syntactic feature set is thus slightly different for each segmentation, since the numberand content of the set of segments vary among segmentations. The pause-based segmentation, withsubstantially longer segments, selects about 28,000 features under this pruning condition; others haveabout 40,000.

49

To avoid memory constraints, we assign each segment to one of 10 separate bins and

train 10 svm-rank models.3 For each experimental combination of segmentation and features

in fi,j , we re-train all 10 rerankers. At evaluation time, the cohort candidates are ranked

by all 10 models and their scores are averaged. The parse (or word-sequence) of the top-

ranked candidate is taken to be the system’s hypothesis for a given segment, and evaluated

according to either the WER or SParseval objective.

3.4 Results

This section describes the results of experiments designed to assess the potential for per-

formance improvement associated with increasing the number of word-sequence vs. parse

candidates, as well as the actual gains achieved by reranking under both WER and SParse-

val objectives and different segmentation conditions. We also include a qualitative analysis

of improvements.

3.4.1 Baseline and Oracle Results

To provide a baseline, we sequentially apply the recognizer, segmenter, and parser, choosing

the top scoring word-sequence and then the top parse choice. We establish upper bounds

for each objective by selecting the candidate from the M × N parse-and-word-sequence

cohort that scores the best on each objective function. The results of these experiments are

reported in tables 3.4 (optimizing for WER with M = 1 and different N) and 3.5 (optimizing

for SParseval with N = 50 and different M). The number in parentheses corresponds to

the mismatched condition — picking a candidate based on one criterion and scoring it with

another. Both sets of results show that improving one objective leads to improvements in

the other, since word errors are incorporated into the SParseval score.

Table 3.4 shows that the N -best cohorts contain a potential WER error reduction of

32%. Larger gains are possible for the shorter-segment segmentation conditions, due to the

increase in the number of available alternatives when generating N -best lists from more

3Each candidate set is generated by a single leave-n-out parser (populated by conversation-side), but eachsvm-rank bin (populated by segments, not by conversation sides) includes some cohorts from each of theleave-n-out tenths.

50

Table 3.4: Baseline (1-best serial processing) and oracle WER reranking performance from

N = 50 word sequence hypotheses and 1-best parse. Parenthesized values indicate (unop-

timized) SParseval scores of the selected hypothesis.

Serial Baseline 1xN WER Oracle

Segmenter WER (SParseval) WER (SParseval)

Pause 23.7 (68.2) 17.6 (70.7)

Min-SER 23.7 (70.7) 16.7 (73.7)

Over-seg 23.7 (70.9) 16.2 (74.1)

Reference 23.7 (72.5) 16.2 (77.0)

(and shorter) confusion networks.

Table 3.5 shows that there is a potential of 39% reduction parse error (1-F ) between

the serial baseline (F = 72.5) and the joint M ×N optimization (F = 83.2) with the oracle

segmentation. The potential benefit is smaller for the pause-based segmentation (F = 68.2

vs. 75.2), both in terms of the relative improvement (22%) and the absolute F score. The

possible benefit of automatic segmentation falls between these ranges, with slightly better

results for the over-segmented case. We observe smaller gains in going from M = 10 to

M = 50 parses (and no gains in the automatic segmentation cases), so only M = 10 parses

are used in subsequent experiments, to reduce memory requirements in training.

We can also compare the benefit of increasing N vs. M . Figure 3.4 illustrates the

trade-off for reference segmentations, showing that there is a bigger benefit from increasing

N than M . However, a comparison of the results in the two tables shows that there is a

significant gain in SParseval parse performance from increasing both of N×M : if only M

is increased (from 1× 1 to 1× 50), the potential benefit is 25% error reduction. If increased

to 10× 50, possible reduction is 36% (39% for 50× 50).

51

Table 3.5: Oracle SParseval (WER) reranking performance from N = 50 word sequence

hypotheses and M = 1, 10, or 50 parses. Parenthesized values indicate (unoptimized) WER

of the selected hypotheses.

Parse Oracle (N = 50)

M = 1 M = 10 M = 50

Segments (WER) SParseval (WER) SParseval (WER) SParseval

Pause (20.8) 72.7 (20.3) 74.4 (20.0) 75.2

Min-SER (20.3) 75.8 (19.7) 78.0 (19.7) 78.0

Over-seg (20.0) 76.2 (19.3) 78.5 (19.3) 78.5

Reference (19.1) 79.4 (18.3) 82.3 (18.1) 83.2

0 10 20 30 40 50N

0

10

20

30

40

50

M

0.7650.780

0.795

0.810

0.825

0.825

SParseval oracle performance

Figure 3.4: Oracle parse performance contours for different numbers of parses M and recog-

nition hypotheses N on reference segmentations.

52

Table 3.6: Reranker feature combinations. Additionally all feature sets also contain the

per-word-sequence features pr(wi), Ci and Bi.

Feature Set Additional features Per

ASR (No additional features) word sequence

ParseP pp(ti,j , wi) parse

ParseLM pplm(wi) word sequence

ParseP+NLSF pp(ti,j , wi), ψ(ti,j) parse

ParseLM+E[NLSF] pplm(wi), E[ψi] word sequence

3.4.2 Optimizing for WER

We also investigate whether providing multiple M -best parses to the reranker augments

the parsing knowledge source when optimizing for WER (compared to using only one parse

annotation, or to using no parse annotation at all). To examine this, we explore different

alternatives for creating the feature-vector representation fi,j of a word-sequence candidate,

as summarized in Table 3.6. All experiments include recognizer confidences pw(wi), word

count Ci, empty-hypothesis flag Bi, and parser posteriors pp(ti,j , wi) in the feature vector.

Table 3.6 shows all the feature combinations investigated with the feature names used here.

Table 3.7 shows the WER results of all the segmentation conditions and feature sets,

which can be compared to the baseline serial result of 23.7%. Reranking with the ASR

features alone does not improve performance, since there is little that the reranker can

learn (acoustic and language model scores are combined in the process of generating N -best

lists from confusion networks). The WER performance is worse than baseline on the Min-

SER and Ref segmentations, possibly because these segments are relatively longer than the

Over-seg condition, making word length differences a less useful feature. Other results in

table 3.7 confirm that non-local syntactic features ψ(ti,j) (NLSF here) are useful for word

recognition, confirming the results from Collins et al. [2005b]. In addition, there are some

new findings. First, SU segmentation impacts the utility of the parser for word transcription

(as well as for parsing). There is no benefit to using the parse probabilities alone except

53

Table 3.7: WER on the evaluation set for different sentence segmentations and feature sets.

Baseline WER for all segmentations is 23.7%.

Segmentation

Features Pause-based Min-SER Over-seg Ref seg

ASR (M = 1) 23.6 24.2 23.7 24.1

ParseP (M = 1) 23.6 23.7 23.7 23.1

ParseLM 23.7 23.7 23.7 23.1

ParseP+NLSF (M = 1) 23.3 23.4 23.4 22.8

ParseLM+E[NLSF] 23.3 23.3 23.4 22.7

Oracle-WER 17.6 16.7 16.2 16.2

in the case of reference segmentation,4 and the benefit of parse features is greater with the

reference segmentation than with the automatic segmentations (22.7% vs. 23.3% WER).

Second, the use of more than one parse with parse posteriors does not lead to significant

performance gains for any feature set.

While there is not a significant benefit from using M = 10 for the parse probability plus

features, it does give the best result and we will use this in comparisons to the M = 10

SParseval optimization. For all segmentations, the ParseLM+E[NLSF] features provide

a significant reduction (p < 0.001 using the Wilcoxon test) in WER from the baseline,

but only 4–6% of the possible improvement within the N-best cohort is obtained with the

automatic segmentation. When using reference segmentation, reranking with any of the

feature sets provides significant (p < 0.001) WER reductions compared to baseline.

Table 3.8 explores the effect of lowering the parse-flattening γ below 1.0 for those WER-

optimized models that use more than one parse candidate (γ has no effect on expectation

weighting when there is only one parse candidate). The differences introduced by γ = 0.5

or γ = 0.1 are not significantly different than γ = 1.0, and systems trained with γ 6= 1 are

in general slightly worse than those with γ = 1.0. In all further experiments, γ is set to the

4Since the n-gram language model is trained on much more data than the parser, it may be difficult forthe parsing language model to provide added benefit.

54

Table 3.8: Word error rate results for different sentence segmentations and feature sets,

comparing γ parse-flattening for WER optimization when N = 10. The baseline WER for

all segmentations is 23.7%.

Segmentation

Features γ Pause-based Min-SER Over-seg Ref seg

ParseLM γ = 0.1 23.7 23.7 23.7 23.4

ParseLM γ = 0.5 23.6 23.7 23.7 23.2

ParseLM γ = 1.0 23.6 23.7 23.7 23.1

ParseLM+E[NLSF] γ = 0.1 23.3 23.4 23.4 22.9

ParseLM+E[NLSF] γ = 0.5 23.3 23.4 23.4 22.7

ParseLM+E[NLSF] γ = 1.0 23.3 23.3 23.4 22.7

default (1.0).

3.4.3 Optimizing for SParseval

When optimizing for SParseval, we train and evaluate with the feature set that includes

parse-specific features: pr(wi), Ci, Bi, pplm(wi), pp(ti,j , wi), and ψ(ti,j). Table 3.9 summa-

rizes the results for the different segmentation conditions in comparison to the serial baseline

and M × N -best oracle result. WER numbers, reported in parentheses, are the WER of

the leaves of the selected parse. As expected from prior work [Kahn et al., 2004, Harper

et al., 2005], we find an impact on parsing from the segmentation. The best results for all

feature sets are obtained with reference segmentations, and the over-segmented threshold

in automatic segmentation is slightly better than the min-SER case. For reference segmen-

tations, higher parse scores correspond to lower WER, but for other segmentations this is

not always the case for the automatic systems. For all segmentations, optimizing with the

ParseP feature set is better than the baseline (p < 0.01 using per-segment randomization

[Yeh, 2000]).

The non-local syntactic features did not lead to improved parse performance over the

55

Table 3.9: Results under different segmentation conditions when optimizing for SParseval

objective; the associated WER results are reported in parentheses.

Segmentation

Features Pause-based Min-SER Over-seg Ref seg

Baseline (23.7) 68.2 (23.7) 70.7 (23.7) 70.9 (23.7) 72.5

ParseP (24.1) 68.8 (24.0) 71.1 (24.0) 71.3 (23.2) 73.4

ParseP+NLSF (24.3) 69.1 (25.5) 70.4 (25.8) 70.4 (23.5) 73.1

oracle (20.3) 74.4 (19.7) 78.0 (19.3) 78.5 (18.3) 82.3

parse probability alone, and in some cases hurt performance, which seems to contradict

prior results in parse reranking. However, as shown in Figure 3.5, there is an improvement

due to use of features for the case where there is only N = 1 recognition hypothesis, but

that improvement is small compared to gains from increasing N . Figure 3.5 also shows

that optimizing for WER with non-local syntactic features actually leads to better parsing

performance than when optimizing directly for parse performance. We conjecture that this

result is due to overtraining the reranker when the feature dimensionality is high and the

training samples are biased to have many poorly-scoring candidates. The parsing problem

involves many more candidates to rank than WER (300 vs. 30 on average) because parse-

reranking has M ×N candidates while transcription-reranking has at most N candidates.

Since the pool of M × N is much larger, it contains more poorly-ranking candidates and

thus the learning may be dominated by the many pairwise cases involving poor-quality

candidates.

3.4.4 Qualitative observations

We examined the recognizer outputs for the WER optimization with the ParseLM and

expected NLSF features to understand the types of improvements resulting from using a

parse-based language-model for re-ranking. Under this WER optimization on reference

segmentation, of the 8,726 segments in the test set, 985 had WER improvements and 462

56

Figure 3.5: SParseval performance for different feature and optimization conditions as a

function of the size of the N-best list.

had WER degradations. We examined a sample (about 100 each) of these improvements

and degradations. Some improvements (about 15%) are simple determiner recoveries, e.g.

“a” in “have a happy thanksgiving.” Other examples involve short main verbs (also a bit

more than 15%; above 20% if contractions are included), as in:

some evenings are [or] worse than others

that is [as] a pretty big change

used [nice] to live in colorado

well they don’t [—] always

where the corrected word is in boldface and the incorrect word (substitution) output by the

baseline recognizer is italicized and in brackets.

More significant from a language processing perspective are the corrections involving

pronouns (about 5%), which would impact coreference and entity analysis. The parsing LM

recovers lost pronouns and eliminates incorrectly recognized ones, particularly in contrac-

tions with short main verbs, as in the following examples:

57

she was there [they’re] like all winter semester

they’re [there] going to school

we’re [where] the old folks now

(Contraction corrections like these are not included in the count for short main verb cor-

rections.) Further improvements are found in complementizers and prepositions (about 5%

each), while only about 10% of the improvements changed content words. The remaining

45% of improvements are miscellaneous.

Another pronoun example illustrates how the parse features can overcome the bias of

frequent n-grams in conversational speech:

Improved: they *** really get uh into it

Baseline: yeah yeah really get uh into it

Reference: they uh really get uh into it

with substitution errors in italics and deletions indicated by “***.” (The bigram “yeah

yeah” is very frequent in the Switchboard corpus.)

Of the segments that suffered WER degradation under ParseLM+E[NLSF] WER op-

timization, a little more than 15% were errors on a word involved in a repetition or self-

correction, e.g. the omission of the boldface the in:

. . . that’s not the not the way that the society is going

Another 7-10% of these candidates that had WER degradation were more grammatically

plausible than the reference transcription, e.g. the substitution of a determiner a for an

unusually-placed pronoun (probably a correction):

Reference: but i lot of times i don’t remember the names

Optimized: but a lot of times i do not remember the names

Most importantly, these last two classes of WER degradation do not have an impact on the

meaning of the sentence. The remaining roughly 75% of the WER-degraded segments are

difficult to characterize, but are a large-majority of function-words as well.

58

Most of these types of corrections are observed whether the optimization is for WER or

SParseval. Many cases where they give different results have a higher or equal WER for

the SParseval-optimized case, but the result is arguably better, as in:

WER obj.: i know that people *** to arrange their whole schedules . . . (1 error)

SParseval obj.: i know that people used to arrange their whole schedule . . . (1 errors)

Baseline: i know that people easter arrange their whole schedules . . . (2 errors)

Reference: i know that people used to arrange their whole schedules . . .

We compared the WER-optimized segment to SParseval-optimized segments, and found

that about 100 segments had better SParseval and worse WER in the WER-optimized

segment, and better WER and worse SParseval in the WER-optimized segment. Of these

cases, about 15% seem to be cases where the SParseval-optimization is more grammati-

cally plausible than the reference, e.g.:

Reference: i’ve i’ve probably talked maybe to five people

SParseval opt.: i’ve i’ve probably talked to maybe just five people

WER opt: i’ve i’ve probably talked maybe just five people

Reference: now it’s like you know tough and dirty team

SParseval opt.: now it’s like you know a tough and dirty team

WER opt.: now it’s like you know tough and dirty team

Note that it is important that the parser is trained on conversational speech in order to make

useful predictions on conversational phenomena such as the hedging “like, you know” and

the prescriptively proscribed double-adverb “maybe just”. The remaining improvements in

this analysis may be categorized as a variety of other cases.

3.5 Discussion

In this chapter, we have presented a discriminative framework for jointly modeling speech

recognition and parsing, with which we improve both word sequence quality (as measured

59

by WER) and parse quality (as measured by SParseval). We confirm and extend previous

work in using parse structure for language-modeling [Collins et al., 2005b] and in parsing

conversational speech [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005].

Experiments using this framework provide some answers to the questions posed at the

beginning of the chapter. First, we find that parsing performance can be improved sub-

stantially by incorporating parser uncertainty via N -best list rescoring, particularly with

high quality sentence segmentation, although the automatic reranking systems achieve only

a small fraction of the potential gain. Further, allowing for word uncertainty is much more

important than considering parse alternatives. In optimizing for WER, however, no signif-

icant gains are obtained from modeling parse uncertainty in a statistical parser, either in a

language model or in non-local syntactic features. Of course, these findings may depend on

the particular parser used. Finally, we find that sentence segmentation quality is important

for parse information to have a significant impact on speech recognition WER, and that

a good segmentation can increase the potential gains in parsing from considering multiple

word-sequence hypotheses. A conclusion of these findings is that improvements to auto-

matic segmentation algorithms would substantially extend the utility of parsers in speech

processing.

One surprising result was that non-local syntactic features in reranking were of more

benefit to speech recognition than to parsing and, in fact, sometimes hurt parsing perfor-

mance. We conjecture that this result is due to the fact that the joint parsing problem

involves many more poor candidate pairs among reranker training samples, which seems to

be problematic for the learner when the features are high-dimensional. It may be that other

types of rerankers are better suited to handling such problems.

61

Chapter 4

USING GRAMMATICAL STRUCTURE TO EVALUATE MACHINETRANSLATION

This chapter1 explores a different use of grammatical structure prediction: its use in

predicting the quality of machine translation. As suggested in chapters 1 and 2, a key

challenge in automatic machine translation evaluation is to account for allowable variability,

since two equally good translations may be quite different in surface form. This is especially

challenging when the evaluation measures used consider only the word-sequence.

We motivate the use of dependencies for SMT evaluation with two example machine

translations (and a human-translated reference):

Ref: Authorities have also closed southern Basra’s airport and seaport.

S1: The authorities also closed the airport and seaport in the southern port of Basra.

S2: Authorities closed the airport and the port of.

(4.1)

A human evaluator judged the system 1 result (S1) as equivalent to the reference, but

indicated that the system 2 (S2) result had problematic errors. BLEU4 (a popular automatic

metric for SMT) gives S1 and S2 similar scores (0.199 vs. 0.203). TER (another popular

metric) prefers S2 (with an error of 0.7 vs. 0.9 for S1), since a deletion requires fewer edits

than rephrasing. EDPM (the new metric described later in this chapter) provides a score for

S1 (0.414) that is preferred to S2 (0.356), reflecting EDPM’s ability to match dependency

structure. The two phrases “southern Basra’s airport and seaport” and “the airport and

seaport in the southern port of Basra” have more similar dependency structure than word

order.

The next section (4.1) reviews some relevant research in the evaluation of machine trans-

lation. In section 4.2, this chapter describes a family of dependency pair match (DPM) au-

tomatic machine-translation metrics, and section 4.3 describes the infrastructure and tools

1Matthew Snover provided invaluable assistance in a version of this work, which has been published asKahn et al. [2009].

62

used to implement that family. Sections 4.4 and 4.5 explore two ways to compare members

of this family with human judgements. Section 4.6 explores the potential to adapt the

EDPM component measures by combining them with another state-of-the-art MT metric’s

use of synonym tables and other word-sequence and sub-word features. Section 4.7, finally,

discusses the broader implications and future directions for these findings.

4.1 Background

Currently, the most popular approaches for automatic MT evaluation are BLEU [Papineni

et al., 2002], based on n-gram precision, and Translation Edit Rate (TER), an edit distance

[Snover et al., 2006]. These measures can only account for variability when given multiple

translations, and studies have shown that they may not accurately track translation quality

[Charniak et al., 2003, Callison-Burch, 2006]. Both BLEU and TER are word-sequence

measures: they use exclusively features of the word-sequence and no knowledge of language

similarity or structure beyond that sequence.

Some alternative measures have proposed using external knowledge sources to explore

mappings within the words themselves, such as synonym tables and morphological stem-

ming, e.g. METEOR [Banerjee and Lavie, 2005] and the ATEC measure [Wong and Kit,

2009]. TER Plus (TERp) [Snover et al., 2009], which is an extension of the previously-

mentioned TER, also incorporates synonym sets and stemming, along with automatically-

derived paraphrase tables. Still other systems attempt to map language similarity measures

into a high-level semantic entailment abstraction, e.g. [Pado et al., 2009].

By contrast, this chapter’s research proposes a technique for comparing syntactic decom-

positions of the reference and hypothesis translations. Other metrics modeling syntactically-

local (rather than string-local) word-sequences include tree-local n-gram precision in various

configurations of constituency and dependency trees [Liu and Gildea, 2005] and the d and

d var measures proposed by Owczarzak et al. [2007a,b], which compare relational tuples

derived from a lexical functional grammar (LFG) over reference and hypothesis transla-

tions.2

2 Owczarzak et al. [2007a] extend their previous line of research [Owczarzak et al., 2007b] by variably-weighting dependencies and by including synonym matching, two directions not pursued here. Hence,

63

Any syntactic-dependency-oriented measure requires a system for proposing dependency

structure over the reference and hypothesis translations. Liu and Gildea [2005] use a PCFG

parser with deterministic head-finding, while Owczarzak et al. [2007a] extract the seman-

tic dependency relations from an LFG parser [Cahill et al., 2004]. This chapter’s work

extends the dependency-scoring strategies of Owczarzak et al. [2007a], which reported sub-

stantial improvement in correlation with human judgement relative to BLEU and TER,

by using a publicly-available probabilistic context-free grammar (PCFG) parser and deter-

ministic head-finding rules, rather than an LFG parser. In addition, this chapter considers

alternative syntactic decompositions and alternative mechanisms for computing score com-

binations. Finally, the work presented here explores combination of syntax with synonym-

and paraphrase-matching scoring metrics.

Evaluation of automatic MT measures requires correlation with MT evaluation mea-

sures performed by human beings. Some [Banerjee and Lavie, 2005, Liu and Gildea, 2005,

Owczarzak et al., 2007a] compare the measure to human judgements of fluency and ade-

quacy. Other work Snover et al. [e.g. 2006] compares measures’ correlation with human-

targeted TER (HTER), an edit-distance to a human-revised reference. The metrics de-

veloped here are evaluated in terms of their correlation against both fluency/adequacy

judgement and against HTER scores.

4.2 Approach: the DPM family of metrics

The specific family of dependency pair match (DPM) measures described here combines

precision and recall scores of various decompositions of a syntactic dependency tree. Rather

than comparing string sequences, as BLEU does with its n-gram precision, this approach

defers to a parser for an indication of the relevant word tuples associated with meaning — in

these implementations, the head on which that word depends. Each sentence (both reference

and hypothesis) is converted to a labeled syntactic dependency tree and then relations from

each tree are extracted and compared. These measures may be seen as generalizations of

the earlier paper is cited in comparisons. Section 4.6 includes synonym matching, but over data whichare not directly comparable with either Owczarzak paper and using an entirely different mechanism forcombination.

64

Reference Hypothesis

treeThe red cat ate 〈root〉

detmod subj root

The cat stumbled 〈root〉

det subj root

dlh list

〈the,det→ , cat 〉

〈red,mod→ , cat 〉

〈cat,subj→ , ate 〉

〈ate,root→ , <root>〉

〈 the ,det→ , cat 〉

〈 cat ,subj→ , stumbled〉

〈stumbled,root→ , <root> 〉

Figure 4.1: Example dependency trees and their dlh decompositions.

dl lh

〈 the ,det→ 〉

〈 cat ,subj→ 〉

〈stumbled,root→ 〉

〈 det→ , cat 〉

〈 subj→ , stumbled〉

〈root→ , <root> 〉

Figure 4.2: The dl and lh decompositions of the hypothesis tree in figure 4.1.

the dependency-pair F measures found in Owczarzak et al. [2007b].

The particular relations that are extracted from the dependency tree are referred to

here as decompositions. Figure 4.1 illustrates the dependency-link-head decomposition of a

toy dependency tree into a list of 〈d, l, h〉 tuples. Some members of the DPM family may

apply more than one decomposition; other good examples are the dl decomposition, which

generates a bag of dependent words with outbound links, and the lh decomposition, which

generates a bag of inbound link labels, with the head word for each included. Figure 4.2

shows the dl and lh decompositions for the same hypothesis tree.

The decompositions explored in various configurations in this chapter include:

dlh 〈Dependent, arc Label,Head〉 – full triple

dl 〈Dependent, arc Label〉 – marks how the word fits into its syntactic context

lh 〈arc Label,Head〉 – implicitly marks how key the word is to the sentence

65

dh 〈Dependent,Head〉 – drops syntactic-role information.

1g,2g – simple measures of unigram (bigram) counts

Various members of the family may choose to include more than one of these decomposi-

tions.3

It is worth noting here that the dlh and lh decompositions (but not the dl decomposition)

“overweight” the headwords, in that there are n elements in the resulting bag, but if a word

has no dependents it is found in the resulting bag exactly one time (in the dlh case) or

not at all (in the lh case). Conversely, syntactically “key” words, those on which many

other words in the tree depend, are included multiple times in the decomposition (once for

each inbound link). This “overweighting” effectively allows the grammatical structure of

the sentence to indicate which words are more important to translate correctly, e.g. “Basra”

in example (4.1), or head verbs (which participate in multiple dependencies).

A statistical parser provides confidences associated with parses in a probabilistically-

weighted N -best list, which we use to compute expected (probability-weighted) counts for

each decomposition in both reference and hypothesized translations. By using expected

counts, we may count partial matches in computing precision and recall. This approach

addresses both the potential for parser error and for syntactic ambiguity in the translations

(both reference and hypothesis).

When multiple decomposition types are used together, we may combine these subscores

in a variety of ways. Here, we experiment with using two variations of a harmonic mean:

computing precision and recall over all decompositions as a group (giving a single precision

and recall number) vs. computing precision and recall separately for each decomposition.

We distinguish between these using the notation in (4.2) and (4.3):

F [dl, lh] = µh (Prec (dl ∪ lh) ,Recall (dl ∪ lh)) (4.2)

µPR[dl, lh] = µh (Prec (dl) ,Recall (dl) ,Prec (lh) ,Recall (lh)) (4.3)

where µh represents a harmonic mean. (Note that when there is only one decomposition,

3No d decomposition is included: this would be equivalent to a 1g decomposition. h decomposition mightcapture the syntactic weighting without the syntactic role that lh captures, but we find that lh has thesame effect.

66

as in F [dlh], F [·] ≡ µPR[·].) Dependency-based SParseval [Roark et al., 2006] and the

d approach from Owczarzak et al. [2007a] may each be understood as F [dlh] (although

SParseval focuses on the accuracy of the parse, and Owczarzak et al. use a different

mechanism for generating trees for decomposition). The latter’s d var method may be

understood as something close to F [dl, lh]. BLEU4 is effectively µP (1g . . . 4g) with the

addition of a brevity penalty. Both the combination methods F and µPR are “naive” in that

they treat each component score as equivalent. When we introduce syntactic/paraphrasing

features in section 4.6, we will consider a weighted combination.

4.3 Implementation of the DPM family

The entire family of DPM measures may be implemented with any parser that generates

a dependency graph (a single labeled arc for each word, pointing to its head-word). Prior

work [Owczarzak et al., 2007a] on related measures has used an LFG parser [Cahill et al.,

2004] or an unlabelled dependency tree [Liu and Gildea, 2005].

In this work, we use a state-of-the-art PCFG (the first stage of Charniak and Johnson

[2005]) and context-free head-finding rules [Magerman, 1995] to generate an N -best list of

dependency trees for each hypothesis and reference translation. We use the parser’s default

(English) Wall Street Journal training parameters. Head-finding uses the Charniak parser’s

rules, with three modifications to make the semantic (rather than syntactic) relations more

dominant in the dependency tree: prepositional and complementizer phrases choose nom-

inal and verbal heads respectively (rather than functional heads) and auxiliary verbs are

dependents of main verbs (rather than the converse). These changes capture the idea that

main verbs are more important for adequacy in translation, as illustrated by the functional

equivalence of “have also closed” vs. “also closed” in the introductory example.

Having constructed the dependency tree, we label the arc between dependent d and

its head h as A/B when A is the lowest constituent-label headed by h and dominating d

and B is the highest constituent label headed by d. For illustrations, in figure 4.3, the

s node is the lowest node headed by stumbled that dominates cat, and the np node is

the highest constituent label headed by cat, so the arc linking cat to stumbled is labelled

s/np. This strategy is very similar to one adopted in the reference implementation of

67

root/stumbled

s/stumbled

np/cat

dt/the

the

nn/cat

cat

vp/stumbled

vbd/stumbled

stumbled The cat stumbled 〈root〉

np/dt s/np root/s

Figure 4.3: An example headed constituent tree and the labeled dependency tree derived

from it.

labelled-dependency SParseval [Roark et al., 2006], and may be considered as a shallow

approximation of the rich semantics generated by LFG parsers [Cahill et al., 2004]. The

A/B labels are not as descriptive as the LFG semantics, but they have a similar resolution

in English (with its relatively fixed word order), e.g. the s/np arc label usually represents

a subject dependent of a sentential verb.

For the cases where we have N -best parse hypotheses, we use the associated parse prob-

abilities (or confidences) to compute expected counts. The sentence will then be represented

with more tuples, corresponding to alternative analyses. For example, if the N -best parses

include two different roles for dependent “Basra”, then two different dl tuples are included,

each with the weighted count that is the sum of the confidences of all parses having the

respective role.4

The parse confidence p is normalized so that the N -best confidences sum to one. Because

the parser is overconfident, we explore a flattened estimate: p(k) = p(k)γ∑ip(i)γ

, where k, i index

the parse and γ is a free parameter.

4 The use of expectations with N -best parses is different from d 50 and d 50 pm in Owczarzak et al.[2007a], in that the latter uses the best-matching pair of trees rather than an aggregate over the tree setsand they do not use parse confidences.

68

4.4 Selecting EDPM with human judgements of fluency & adequacy

We explore various configurations of the DPM by assessing the results against a corpus

of human judgements of fluency and adequacy, specifically the LDC Multiple Translation

Chinese corpus parts 2 [LDC, 2003] and 4 [LDC, 2006], which are composed of English

translations (by machine and human translators) of written (and edited) Chinese newswire

articles. For each article in these corpora, multiple human evaluators provided judgements

of fluency and adequacy for each sentence (assigned on a five-point scale), with each judge-

ment using a different human judge and a different reference translation. For a rough5

comparison with Owczarzak et al. [2007a], we treat each judgement as a separate segment,

which yields 16,815 tuples of 〈hypothesis, reference, fluency, adequacy〉. We compute per-

segment correlations.6 The baselines for comparison are case-sensitive BLEU (4-grams, with

add-one smoothing) and TER.

The specific dimensions of DPM explored include:

Decompositions. We compute precision and recall of several different decompositions:

d,dl,dlh increasing n-grams, directed up through the tree, as inspired by BLEU4 and

Liu and Gildea [2005].

dl,lh partial decomposition, to match d var

dlh all labeled dependency link pairs, as suggested by SParseval and d

1g,2g surface unigrams and bigrams only

Parser variations. When using more than one parse, we explore:

Size of N-best list. 1 (adopting only the best parse) or 50 (as in Owczarzak et al.

[2007a])

5Our segment count differs slightly from Owczarzak et al. [2007a] for the same corpus: 16,807 vs. 16,815.As a result, the baseline per-segment correlations differ slightly (BLEU4 is higher here, while TER here islower), but the trends in gains over those baselines are very similar.

6The use of the same hypothesis translations in multiple comparisons in the Multiple Translation Corpusmeans that scored segments are not strictly independent, but for methodological comparison with priorwork, this strategy is preserved.

69

Table 4.1: Per-segment correlation with human fluency/adequacy judgements of different

combination methods and decompositions.

metric r

BLEU4 0.218

F [1g, 2g, dl, lh] 0.237

µPR[1g, 2g, dl, lh] 0.217

F [1g, 2g] 0.227

µPR[1g, 2g] 0.215

F [1g, dl, dlh] 0.227

F [dl, lh] 0.226

µPR[dl, lh] 0.208

Parse confidence. The distribution flattening parameter is varied from γ = 0 (uni-

form distribution) to γ = 1 (no flattening).

Score combination. Global F vs. component harmonic mean µPR.

4.4.1 Choosing a combination method: F vs. µPR

In table 4.1, we compare combination methods for a variety of decompositions. These

results demonstrate that F consistently outperforms µPR as well as the BLEU4 baseline

(see table 4.2). µPR measures are never better than BLEU; µPR combinations are thus not

considered further in this work.

4.4.2 Choosing a set of decompositions

Considering only the 1-best parse, we compare DPM with different decompositions to the

baseline measures. Table 4.2 shows that all decompositions except [dlh] have a better

per-segment correlation with the fluency/adequacy scores than TER or BLEU4. Includ-

ing progressively larger chunks of the dependency graph with F [1g, dl, dlh], inspired by the

70

Table 4.2: Per-segment correlation with human fluency/adequacy judgements of baselines

and different decompositions. N = 1 parses used.

metric |r|

F [1g, 2g, dl, lh] 0.237

F [1g, 2g] 0.227

F [dl, lh] 0.226

BLEU4 0.218

F [dlh] 0.185

TER 0.173

BLEUk idea of progressively larger n-grams, did not give an improvement over [dl, lh]. De-

pendencies [dl, lh] and string-local n-grams [1g, 2g] give similar results, but the combination

of all four decompositions [1g, 2g, dl, lh] gives further improvement in correlation over their

use in isolation. The results also confirm, with a PCFG, what Owczarzak et al. [2007a]

found with an LFG parser: that partial-dependency matches are better correlated with hu-

man judgements than full-dependency links. We speculate that this improvement is because

partial-dependency matches are more forgiving: they allow the system to detect that a word

is used in the proper context without requiring its syntactic neighbors to also be translated

in the same way.

4.4.3 Choosing a parse-flattening γ

Since the parser in our implementation provides a confidence in each parse, we explore the

use of that confidence with the γ free parameter and N = 50 parses. Table 4.3 explores

various “flattenings” (values of γ) of the parse confidence in the F [·] measure. γ = 1 is

not always the best, suggesting that the parse probabilities p(tree|words) are overconfident.

The differences are small, but the trends are consistent across all the decompositions tested

here. We find that γ = 0.25 is generally the best flattening of the parse confidence for

the variants of this measure that we have tested: it is nearest the maximum r for both

71

Table 4.3: Considering values of γ,N = 50 (and one N = 1 case) for two different sub-graph

lists (dl, lh and 1g, 2g, dl, lh).

γ F [1g, 2g, dl, lh] F [dl, lh]

1 0.239 0.232

0.75 0.240 0.233

0.5 0.240 0.234

0.25 0.240 0.234

0 0.239 0.234

[N = 1] 0.237 0.226

decompositions in table 4.3, though rounding hides the exact maxima.

Table 4.3 also shows the effect of using N -best parses for different decompositions. The

N = 50 cases are uniformly better than N = 1. While not all of these differences are

significant, there is a consistent trend of correlation r improving with 50 vs. 1 parse.

In summary, exploring a number of variants of the DPM metric against an average

fluency/adequacy judgement leads to a best-case of:

EDPM = F [1g, 2g, dl, lh], N = 50, γ = 0.25

We use this configuration in experiments assessing correlations with HTER.

4.5 Correlating EDPM with HTER

In this section, we compare the EDPM metric selected in the previous section to baseline

metrics in terms of document- and segment-level correlation with HTER scores using the

GALE 2.5 translation corpus [LDC, 2008]. The corpus includes system translations into

English from three SMT research sites, all of which use system combination to integrate re-

sults from several systems, some phrase-based and some that use syntax on either the source

or target side. No system provided system-generated parses; the EDPM measure’s parse

structures are generated entirely at evaluation time. The source data includes Arabic and

Chinese in four genres: bc (broadcast conversation), bn (broadcast news), nw (newswire),

72

Table 4.4: Corpus statistics for the GALE 2.5 translation corpus.

Arabic Chinese Total

doc sent doc sent doc sent

bc 59 750 56 1061 115 1811

bn 63 666 63 620 126 1286

nw 68 494 70 440 138 934

wb 69 683 68 588 137 1271

Total 259 2593 257 2709 516 5302

and wb (web text), with corpus sizes shown in table 4.4. This data may thus be broken

down in several ways — in one large corpus, or into by language into two corpora (one

derived from Arabic and one from Chinese), or by genre (into four) or by language×genre

(eight subcorpora). The corpus includes one English reference translation [LDC, 2008] for

each sentence and a system translation for each of the three systems. Additionally, each

of the system translations has a corresponding “human-targeted” reference aligned at the

sentence level, so we may compute the HTER score at both the sentence and document

level.

HTER and automatic scores all degrade, on average, for more difficult sentences. Since

there are multiple system translations in this corpus, it is possible to roughly factor out this

source of variability by correlating mean normalized scores,7 m(ti) = m(ti)− 1I

∑Ij=1m(tj)

where m can be HTER, TER, BLEU4 or EDPM, and ti represents the i-th translation of

segment t. Mean-removal ensures that the reported correlations are among differences in

the translations rather than among differences in the underlying segments.

7Previous work Kahn et al. [2008] reported HTER correlations against pairwise differences among trans-lations derived from the same source to factor out sentence difficulty, but this violates independenceassumptions used in the Pearson’s r tests.

73

Table 4.5: Per-document correlations of EDPM and others to HTER, by genre and by

source language. Bold numbers are within 95% significance of the best per column; italics

indicate that the sign of the r value has less than 95% confidence (that is, the value r = 0

falls within the 95% confidence interval).

r vs. HTER bc bn nw wb all Arabic all Chinese all

TER 0.59 0.35 0.47 0.17 0.54 0.32 0.44

−BLEU4 0.42 0.32 0.46 0.27 0.42 0.33 0.37

−EDPM 0.69 0.39 0.47 0.27 0.60 0.39 0.50

Table 4.6: Per-sentence, length-weighted correlations of EDPM and others to HTER, by

genre and by source language. Bold numbers indicate significance as above.

r vs. HTER bc bn nw wb all Arabic all Chinese all

TER 0.44 0.29 0.33 0.25 0.44 0.25 0.36

−BLEU4 0.31 0.24 0.29 0.25 0.31 0.24 0.28

−EDPM 0.46 0.31 0.34 0.30 0.44 0.30 0.37

4.5.1 Per-document correlation with HTER

Table 4.5 shows per-document Pearson’s r between −EDPM and HTER, as well as the

TER and −BLEU4 baselines’ Pearson’s r with HTER. (We correlate with negative BLEU4

and EDPM to keep the sign of a good correlation positive.) EDPM has the best correlation

overall, as well as in each of the subcorpora created by dividing by genre or by source

language. In structured data (bn and nw), these differences are not significant, but in the

unstructured domains (wb and bc), EDPM is always significantly better than at least one

of the comparison baselines.

74

4.5.2 Per-sentences correlation with HTER

Table 4.6 presents per-sentence (rather than per-document) correlations based on scores,

weighted by sentence length in order to get a per-word measure of correlation which reduces

variance across sentences. (Even with length weighting, the r values have smaller magnitude

due to the higher variability at the sentence level.) EDPM again has the largest correlation

in each category, but TER has r values within 95% confidence of EDPM scores on nearly

every breakdown.

4.6 Combining syntax with edit and semantic knowledge sources

While the results in the previous section show that EDPM is as good or better than base-

line measures TER and BLEU4, the correlation is still low. This result is consistent with

intuitions derived from the example in section 4.2, where the EDPM score is much less

than 1 for the good translation. For that reason, we investigated combining the alternative

wording features (synonymy and paraphrase) of TERp [Snover et al., 2009] with the EDPM

syntactic features.

The TERp tools take an entirely different approach from EDPM. Rather than intro-

duce grammatical structure, the TERp (“TER plus”) model extracts counts of multiple

classes of edit operations and linearly combines the costs of those operations. These op-

erations extend the TER operations (insert, delete, substitute, and shift) to include also

“substitute-stem”, “substitute-synonym” and “substitute-paraphrase” operations that rely

on external knowledge sources (stemmers, synonym tables, and paraphrase tables respec-

tively). TERp’s approach thus exploits a knowledge source that is relatively well-separated

from the grammatical-structure information provided by EDPM.

To determine the relative cost of each class of edit operation, TERp provides an optimizer

for weighting multiple simple subscores. The TERp optimizer performs a hill-climbing

search, with randomized restarts, to maximize the correlation of a linear combination of the

subscores with a set of human judgements. Within the TERp framework, the subscores are

the counts of the various edit types, normalized for the length of the reference, where the

counts are determined after aligning the MT output to the reference using default (uniform)

75

edit costs.

The experiments here use the TERp optimizer but extend the set of subscores by includ-

ing the syntactic and n-gram overlap features (modified to reflect false and missed detection

rates for the TERp format rather than precision and recall). The subscores explored include:

E : the 8 fully syntactic subscores from the DPM family, including false/miss error rates

for the expected values of dl, lh, dlh, and dh decompositions.

N : the 4 n-gram subscores from the DPM family; specifically, error rates for the 1g and

2g decompositions.

T : the 11 subscores from TERp, which include matches, insertions, deletions, substitu-

tions, shifts, synonym and stem matches, and four paraphrase edit scores.

For these experiments, we again use the GALE 2.5 data, but with 2-fold cross-validation

in order to have independent tuning and test data. Documents are partitioned randomly,

such that each subset has the same document distribution across source-language and genre.

As in section 4.5.2, the objective is length-normalized per-sentence correlation with HTER,

using mean-removed scores as before. In figure 4.4, we plot the Pearson’s r (with 95%

confidence interval) for the results on the two test sets combined, after linearly normalizing

the predicted scores to account for magnitude differences in the learned weight vectors. The

baseline scores, which involve no tuning, are not normalized.

The left side of figure 4.4 shows that TER and EDPM are significantly more correlated

with HTER than BLEU when measured in this dataset, which is consistent with the overall

results of the previous section. It is also worth noting that the N+E combination is not

equivalent to EDPM (though it has the same decompositions of the syntactic tree), but

EDPM’s combination strategy yields a more robust r correlation with HTER. The N+E

combination outperforms E alone (i.e. it is helpful to use both n-gram and dependency

overlap) but gives lower performance than EDPM because of the particular combination

technique. Both findings are consistent with the fluency/adequacy experiments in sec-

tion 4.4. The TERp features (T in figure 4.4), which account for synonym/paraphrase

76

Figure 4.4: Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM,

BLEU and TER correlations are provided for comparison.

differences, have much higher correlation with HTER than the syntactic E+N subscores.

However, a significant additional improvement is obtained by adding syntactic features to

TERp (T+E). Adding the n-gram features to TERp (T+N) gives almost as much improve-

ment, probably because most dependencies are local. There is no further gain from using

all three subscore types.

4.7 Discussion

In summary, this chapter introduces the DPM family of dependency pair match measures.

Through a corpus of human fluency and adequacy judgements, we select EDPM, a member

of that family with promising predictive power. We find that EDPM is superior to BLEU4

and TER in terms of correlation with human fluency/adequacy judgements and as a per-

document and per-sentence predictor of mean-normalized HTER. We also experiment with

including syntactic (EDPM-style) features and synonym/paraphrase features in a TERp-

style linear combination, and find that the combination improves correlation with HTER

77

over either method alone. EDPM’s approach is shown to be useful even beyond TERp’s

own state-of-the-art use of external knowledge sources.

One difference with respect to the work of Owczarzak et al. [2007a] is the use of a PCFG

vs. an LFG parser. The PCFG has the advantage that it is publicly available and easily

adaptable to new domains. However, the performance varies depending on the amount of

labeled data for the domain, which raises the question of how sensitive EDPM and related

measures are to parser quality.

A limitation of this method for MT system tuning is the computational cost of parsing

compared to word-based measures such as BLEU or TER. Parsing every sentence with the

full-blown PCFG parser, as done here, is hundreds of times slower than these simple n-gram

methods. Two alternative low-cost use scenarios include late-pass evaluation, for choosing

between different system architectures, or system diagnostics, looking at relative quality of

these component scores compared to those of an alternative configuration.

79

Chapter 5

MEASURING COHERENCE IN WORD ALIGNMENTS FORAUTOMATIC STATISTICAL MACHINE TRANSLATION

Syntactic trees (of the type described in section 2.1) fundamentally capture two kinds

of information: dependency and span. Chapters 3 and 4 primarily use dependency links

in their evaluation (from word to word within the same sentence). This chapter, by con-

trast, explores the utility of span information in natural language processing, specifically in

the analysis of automatically-generated word-alignments in statistical machine translation

bitexts.

Statistical machine translation (introduced and briefly sketched in section 2.4) uses word-

to-word alignment as a core component in its model training, perhaps most critically as a

source of aligned bitexts for the construction of the phrase table. For the creation of

the phrase tables, a key concern is that bitext alignments of low quality will induce poor

phrase tables. For example, a single stray alignment link can greatly reduce the number

of useful phrases that may be extracted, as in figure 5.1. In hierarchical or syntactic sta-

tistical MT systems, too, incorrect alignments may lead to lower-quality phrasal structure;

higher-quality alignments offer more opportunities for any of these systems to learn correct

translations by example.

The machine alignments in figure 5.1, for example, prevent the alignment of the noun

phrase “唯一遗憾的” to “the only regret”. It does still allow larger clusters to be mutually

aligned (e.g. “唯一遗憾的是” with “the only regret was in the”) and a few of the smaller

alignments are still possible (e.g., “唯一” may still be aligned straightforwardly to “only”)

but the extra alignment links in the lower alignment force the Chinese span NP1 to be

incoherent: its projection in the English side of the lower alignment surrounds the projection

of words (e.g. 是) that do not belong to NP1.

This chapter makes explicit this mechanism for describing the coherence of a monolingual

80

唯一遗憾的是单杠。

The only regret was in the horizontal bar .

NP1

唯一遗憾的是单杠。

The only regret was in the horizontal bar .

Figure 5.1: A Chinese sentence (about the 2008 Olympic Games) and its translation, with

reference alignments (above) and alignments generated by unioned GIZA++ (below). Bold

dashed links in the lower link-set indicate alignments that force NP1 to be incoherent.

span in an aligned bitext, and explores the coherence of syntactically-motivated spans over

alignments generated by human and machine. Further exploration uses this measure of

coherence to choose among alignment candidates derived from multiple machine alignments,

and a following approach uses coherent regions to assemble a new, improved alignment from

two automatic alignments.

Section 5.1 describes the relevant background for this chapter. Section 5.2 outlines the

notion of coherence used here, and describes how it is computed on a given span. Sec-

tion 5.3 outlines the preparation of data for the explorations performed here: the corpora of

Chinese-English bitexts and manual alignments, and the construction of several automatic

alignments for comparison with these coherence metrics. Section 5.4 examines the per-

formance of the various alignment systems in terms of alignment quality (against manual

alignments) and the coherence of certain linguistically-motivated categories, and demon-

strates that the coherence measures correspond to the alignment quality of those systems.

In section 5.5, we explore using the coherence measures to select a better alignment from

81

a pool of alignment candidates, and section 5.6 explores the creation of hybrid alignments

by combining members from the varied system-alignments assembled here. Section 5.7

discusses the implications (linguistic and practical) of these findings.

5.1 Background

Word alignments, as discussed in section 2.4.1, are an important part of the preparation of

a parallel corpus for the training of statistical machine translation engines. A wide variety

of statistical systems build their models off of aligned parallel corpora – whether to extract

word-by-word translation parameters, as in the IBM models [Brown et al., 1990], “phrase”

tables as in Moses [Koehn et al., 2007], or more syntactically-involved systems such as the

Galley et al. [2006] syntactic translation models. As a tool for building and evaluating these

aligned parallel corpora, Och and Ney [2003] proposed an alignment evaluation scheme

“alignment error rate” (AER), in the hope that an intrinsic measure of evaluating align-

ments could shorten the development cycle for new statistical machine translation systems

(eliminating the need to try the entire pipeline).

AER is based on an F -measure over reference alignments (“sure”, S) and proposed (A)

alignment links:

AER(S,A) = 1− 2× |S ∩A||S|+ |A|

(5.1)

This formulation1 measures individual links rather than groups of links. A variety of other

systems have explored using supervised learning over manually-aligned corpora to improve

AER, with some success, including Ayan et al. [2005a,b], Lacoste-Julien et al. [2006] and

Moore et al. [2006], who mostly focused on improving the AER over English-French parallel

corpora.

Other metrics exist, e.g. CPER [Ayan and Dorr, 2006], which measures the F of possible

aligned phrases for inclusion in the phrase table, but in pilot experiments we found that per-

sentence oracle CPER was not consistent with an improvement in global CPER performance.

Fraser and Marcu [2006, 2007] find that optimizing alignments towards an AER variant

1The original formulation of AER was defined with both “sure” and “possible” reference alignment links.No reference data available for this task uses “possible” alignment links, so only the simplified version ispresented here.

82

that weights recall more heavily than precision higher improves BLEU performance on

the language pairs they explored (English-Romanian, English-Arabic, and English-French).

However, Fossum et al. [2008] find that they can improve a syntactically-oriented statistical

machine translation engine by improving precision; their work focuses on deleting individual

links from existing GIZA++ alignments, using syntactic features derived directly from the

syntactic translation engine for Chinese-English and Arabic-English translation pairs.

Another approach to improving alignments with grammatical structure is to do simulta-

neous parsing and use the parse information to (tightly or loosely) constrain the alignment,

as in Lin and Cherry [2003], Cherry [2008], Haghighi et al. [2009] and Burkett et al. [2010],

who constrain parsers of one (or both) languages to engage in the parallel alignment pro-

cess. Rather than combine parse or span constraint information into a machine translation

or alignment decoder, this chapter explores span coherence measures (with spans derived

from a syntactic parser of Chinese) to select from multiple machine translation alignment

candidates over a corpus of manually-labeled Chinese-English alignments. Since evidence

for preferring an alignment error measure that over-weights precision or recall seems to be

ambiguous (and possibly dependent on the choice of translation engine), we retain AER as

the measure of alignment quality, and we explore the coherence measures’ ability to help

select alignments to reduce AER.

5.2 Coherence on bitext spans

We define a span s to be any region of adjacent words fi · · · fk on one side (here the source

language) of a bitext. Given a set of links a of the form 〈em, fn〉, we define the projection

of a span to be all nodes e such that a link exists between e and some element within s. We

further define the projected range s′ of the span s to be:

s′ = emini{ei∈proj(s)} · · · emaxi{ei∈proj(s)}

and we define the reprojection of the span s to be the projected range of s′ (identifying a

range of nodes in the same sequence as s).

We may thus describe a span s as coherent when the reprojection of s is entirely

within s. However, we find it useful to categorize spans into four categories, characterized

83

Table 5.1: Four mutually exclusive coherence classes for a span s and its projected range s′

coherent The reprojection of s is entirely within s

null No link includes any term in fi . . . fk

subcoherent s is not coherent, but s′ is coherent

incoherent neither s nor s′ is coherent.

fi fi+1 fi+2 fi+3 fi+4

ej ej+1 ej+2 ej+3 ej+4

s1 s2s3 s0

s′1s′2

Figure 5.2: Examples of the four coherence classes. s1 is coherent (because it is its own

reprojection); s0 is null; s2 is incoherent (because its reprojection is s1 rather than a subset

of s2); and s3 is subcoherent (because its projection span s′1 is coherent).

in table 5.1. Figure 5.2 also includes examples of each of the coherence classes. While

coherent, non-coherent, and null coherence classes are fairly easily explained, subcoherent

spans are worth a brief digression: these spans often appear in alignments of two corre-

sponding phrases with non-compositional meanings. Such phrases often form a complete

bipartite subgraph, in that every source word in the phrase is linked to every target word

in the phrase. Any span that includes less than the entire phrase (on one side or the other)

will be subcoherent.

Unlike AER, coherence is not a measure against the reference alignment; it is instead

a measure of a particular span’s behavior in an alignment. It is not necessarily a sign of a

high-quality alignment, but section 5.4 explores how coherence corresponds with AER over

a pool of automatic alignment candidates.

84

Table 5.2: GALE Mandarin-English manually-aligned parallel corpora used for alignment

evaluation and learning. Numbers here reflect the size of the newswire data available from

each corpus. Note that Phase 5 parts 1 and 2 (LDC2010E05 and LDC2010E13) had no

newswire data included.

word

LDC-ID Name sentence English Chinese

2009E54 Chinese Word Alignment Pilot 290 8,818 6,329

2009E83 Phase 4 Chinese Alignment Tagging Part 1 2,092 76,487 55,145

2009E89 Phase 4 DevTest Chinese Word Alignment Tagging 2,829 101,484 73,794

2010E37Phase 5 Chinese Parallel Word Alignment and Tag-

ging Part 3962 33,537 22,018

Total 6,173 220,327 157,286

5.3 Corpus

These experiments focus on the alignment of Chinese-English parallel corpora. They make

use of both unaligned (sentence-aligned but not word-aligned) corpora and manually-aligned

corpora (aligned by both sentence and word). The key corpora for the experiments in this

chapter are the manual alignments generated by the GALE [DARPA, 2008] project. These

alignments take text and transcripts of spoken Mandarin Chinese and translations of both

into English and provide manual annotation of alignment links between the English and

Chinese words. Since Chinese word segmentation is not given by the text, the manual

alignments link English words to Chinese characters, even when more than one Chinese

character is required to form a word. Table 5.2 lists the sets of manual corpora used to

evaluate the aligners (and to train the rerankers). Chinese word counts are listed using

the number of words provided by automatic segmentation, and the alignment links (which

were manually aligned to individual characters) are collapsed to link the English words to

segmented Chinese words (rather than characters). The experiments in this chapter use only

the newswire segments of these corpora, so (although other genres of text and transcript

are available) only those numbers and sizes are reported here.

85

5.3.1 Corpus preparation

The analyses in this chapter are based on a comparison among these manual alignments and

those generated by automatic systems. The most popular automatic aligners GIZA++ [Och

and Ney, 2003] and the Berkeley Aligner [DeNero and Klein, 2007] are unsupervised, but

require training on very large bodies of parallel text; here we do a similar training to avoid

overly pessimistic automatic alignment results. Table 5.3 lists the component corpora used

to train the unsupervised aligners. As in table 5.3, the Chinese word count reflects the

number of word tokens returned by automatic segmentation.

State-of-the-art SMT systems for Chinese-to-English translation do word segmentation

and text normalization (the replacement, for example, of numbers and dates by $number

and $date tokens) before providing parallel text to the unsupervised aligner. In order to

provide automatic alignments for the corpora in table 5.2, all the corpora (both aligned

and unaligned, though alignments were discarded at this stage) were passed through the

Stanford word segmenter [Chang et al., 2008] and the SRI/UW GALE text normalization

system (on the Chinese side) and the RWTH text normalization system (on the English

side). Three aligners were trained on the resulting segmented and normalized parallel text:

• the Berkeley aligner [DeNero and Klein, 2007], which we refer to hereafter as berkeley,

which uses a symmetric alignment strategy;

• the GIZA++ aligner [Och and Ney, 2003], projecting from source-to-target (f -e),

which we refer to as giza.f-e; and

• the GIZA++ aligner, projecting from target-to-source (e-f), referred to as giza.e-f.

For each of the giza trainings, we further generate multiple additional alignment candidates:

the giza.e-f.NBEST and giza.f-e.NBEST lists retrieve the N = 10 best alignments from

each of the GIZA++ trainings. The berkeley system does not support N -best generation.

The parallel corpora from table 5.3 are then discarded: their role is only to improve

the unsupervised aligners trained above. Over the parallel text corpora in table 5.2, all of

86

Table 5.3: The Mandarin-English parallel corpora used for alignment training

word

Name (ID) sentence English Chinese

ADSO Translation Lexicon 179,284 265,705 267,466

Chinese English News Magazine Parallel Text 269,479 9,233,773 8,826,377

Chinese English Parallel Text Project Syndicate 45,767 1,069,021 1,129,198

Chinese English Translation Lexicon (v3.0) 81,521 135,261 93,073

Chinese News Translation Text Part 1 10,264 314,377 279,512

Chinese Treebank English Parallel Corpus 4,064 123,825 92,996

CU Web Data (Oct 07) 34,811 883,886 894,809

FBIS Multilingual Texts 123,950 4,037,811 3,011,172

Found Parallel Text 180,222 5,345,040 4,713,169

GALE Phase 1 Chinese Blog Parallel Text 8,620 185,637 166,508

GALE Phase 2r1 Translations 14,768 347,480 286,093



GALE Phase 3 OSC Alignment (v1.0.FOUO) 4,915 183,812 134,975



GALE Y1 Interim Release Translations 20,926 446,367 398,043

GALE Y1Q1 Translations 6,618 147,574 128,740

GALE Y1Q2 FBIS NVTC Parallel Text (v2.0) 404,368 14,729,700 12,070,648

GALE Y1 Q2 Translations (v2.0) 9,382 194,171 172,106

GALE Y1 Q3 Translations 11,879 283,354 247,961

GALE Y1 Q4 Translations 30,496 572,210 506,563

Hong Kong Parallel Text 699,665 16,154,447 14,650,516

MITRE 1997 Mandarin Broadcast News Speech

Translations (HUB4NE)19,672 414,762 365,157

UMD CMU Wikipedia translation 77,162 181,592 145,069

Xinhua Chinese English Parallel News Text (v1β) 103,415 3,455,994 3,411,085

Total 2,413,691 60,042,470 53,162,751

87

the resulting alignments (including the reference alignments) were reconciled with the pre-

normalized text (using dynamic programming to synchronize the English side and retrieving

the original text from the SRI/UW GALE text-normalization system), but the Chinese word

segmentation was retained (for compatibility with later parsing).

Finally, we generate still more alignment candidates by performing union and intersec-

tion on the giza candidates and their corresponding N -best lists:

• The giza.union alignment is the union of giza.e-f and giza.f-e.

• Correspondingly, the giza.intersect alignment is the intersection of giza.e-f and

giza.f-e.

• The giza.union.NBEST and giza.intersect.NBEST alignments are alignments that

choose alignments from the giza.e-f.NBEST and giza.f-e.NBEST lists and union (or

intersect) them. In these experiments, the ranks of the e-f and f-e elements in these

unions or intersections are constrained to sum to no more than N + 1 = 11 (e-f2

union f-e9 is acceptable because 2+9 = 11 6> 11, but e-f5 union f-e7 is not included

because 5 + 7 = 12 > 11).

5.3.2 Alignment performance of automatic aligners

Table 5.4 reports the AER as computed against the manual labeling of the corpus for the

five automatic alignments (excepting the N -best lists). It also includes the precision, recall,

and link density (in proportion to reference ldc link density). The Berkeley aligner has the

best AER, and its nearest competitor (the giza.union alignment) has the best alignment

recall. As one might expect, the giza.intersect alignments have the highest precision,

but this high precision comes at a high recall (and AER) cost. This table also includes

the per-sentence AER oracle alignment, which reflects the AER (as well as precision and

recall) of choosing, for each segment, the single alignment from the pool of five with the

best AER.2 The Berkeley system, we may note, seems to be strongly precision heavy, while

2The per-sentence oracle is not necessarily the best possible overall AER from this candidate pool, sinceunder some circumstances (especially when the precision/recall proportions are very imbalanced) one mayminimize sentence-level AER and increase global AER.

88

Table 5.4: Alignment error rate, precision, and recall for automatic aligners. Link density

is in percentage of ldc links.

System AER Precision Recall Link density (%)

berkeley 32.87 84.21 55.81 68.52

giza.e-f 36.46 70.14 58.08 88.22

giza.f-e 40.31 76.78 48.82 67.06

giza.intersect 42.37 96.78 41.04 32.90

giza.union 35.42 63.34 65.87 122.38

(per-sentence) oracle 30.12 80.01 62.02 75.78

its competitor giza.union is more balanced, but stronger on recall.

As we might expect, the giza.intersect and giza.union alignments have the lowest

and highest link density respectively. Precision seems to rise as link density drops (again,

not unexpectedly), but berkeley is more precise than either of the directional giza systems

while having a higher link density. Even the oracle selection has a lower link density than

100%, because most of the candidates from which the per-sentence oracle are lower-density

than the reference ldc alignments.

5.4 Analyzing span coherence among automatic word alignments

This section poses two questions:

• what kinds of spans are reliably coherent in reference alignments?

• what varieties of coherent spans are not captured well by current alignment algo-

rithms?

We examine the coherence of reference alignments to answer the first question, and compare

those coherences to those generated by the unsupervised automatic alignment systems. We

explore both syntactic and orthographic (not explicitly syntactic) techniques for identifying

spans over the Chinese source sentences.

89

Table 5.5: Coherence statistics over the spans delimited by comma classes

Alignment (% of spans)

span

(counts)coherence

ldc

(ref)giza.union berkeley giza.intersect

comma

(16,932)

yes 77.9 23.4 58.9 83.1

no 19.2 60.6 35.3 13.7

sub 2.8 16.0 5.3 0.0

null 0.1 0.0 0.5 3.2

tadpole

(14,682)

yes 83.5 22.8 62.1 88.0

no 14.0 59.5 32.6 10.0

sub 2.3 17.7 5.0 0.0

null 0.1 0.0 0.3 2.0

5.4.1 Orthographic spans

The first class of spans we consider are spans that may be extracted from orthographic

“segment” choices, namely, spans that are delimited by commas on the Chinese side of the

bitext. This delimitation is made more complicated by an property of Chinese orthography:

in many Chinese texts, a special3 “enumeration comma” is used to delimit items in a list.

If this standard were used uniformly, it would actually be a useful distinction, but the

enumeration comma is only used sometimes: on many occasions, when an enumeration

comma would be correct, Chinese writers or typesetters will use U+002C COMMA, which

we dub a “tadpole” comma to distinguish it from the conjoined class that includes the

enumeration comma. Nevertheless, the converse error (using enumeration commas when

tadpole commas are appropriate) does not seem to occur in the corpus, so there is still

information available in using only the tadpole commas as delimiters.

We explore dividing sentences into orthographic regions using the orthographic di-

3Unicode uses code U+3001 for this symbol (、) and unfortunately dubs this character IDEOGRAPHIC

COMMA, which is a misnomer: its Chinese name is “顿号” which may be glossed as pause symbol.

90

viders of comma-delimited spans and “tadpole”-delimited spans. These delimiters — when

present — divide the sentence into non-overlapping regions. Table 5.5 shows the distribu-

tion of coherence values for comma-delimited spans and for tadpole-delimited spans over

the manually- and automatically-generated alignments in table 5.2. Sentences without a

comma-delimiter are omitted from the counts here, or the proportion of coherence would

be (trivially) higher in all alignments (since a span covering the entire sentence will al-

ways be coherent). From the reference (ldc) alignments differences, we can see that using

tadpole spans instead of commas improves the proportion of coherent spans to 83.5% and

reduces the proportion of non-coherent spans; this result speaks to the utility of excluding

the enumeration comma from use as a delimiter.

Also, we may observe that the berkeley alignments, which have the best AER of the

three automatic systems compared here, also consistently have intermediate values (be-

tween the low-recall giza.intersect system and the low-precision giza.union) in all four

of the coherence classes. Among these orthographic spans, the high-precision, low-recall

giza.intersect system performs the closest to ldc manually-annotated alignments, but

overpredicts both coherence and null-links, probably because its link density is too low

overall.

By comparing the coherence measures over these orthographic alignments, we find sup-

porting evidence that the berkeley alignments are the best (because they are the most

similar to the reference). It is difficult to say whether the orthographic (comma- and

tadpole-delimited) spans are useful constraints on alignment regions for evaluating align-

ment accuracy, however: the giza.union and giza.intersect results confirm that the

over-linking and under-linking (respectively) even cross these comma delimiters. However,

orthographic delimitation represents a mixture of different linguistic phenomena, so we turn

instead to grammatical span exploration.

5.4.2 Syntactic spans

To explore the information available from the parser, we parse each of the source sentences in

the aligned corpus with a parser [Harper and Huang, 2009] tuned to produce Penn Chinese

91

Table 5.6: Coherence statistics over the spans delimited by certain syntactic non-terminals

Alignment (% of spans)

span

(count)coherence ldc giza.union berkeley giza.intersect

NP

(59,635)

yes 72.4 44.5 74.9 81.0

no 16.4 44.6 17.0 4.1

sub 10.0 10.9 3.9 0.0

null 1.2 0.0 4.2 14.9

VP

(37,167)

yes 64.4 26.3 64.0 78.5

no 20.7 58.1 28.1 9.6

sub 14.4 15.7 5.3 0.0

null 0.6 0.0 2.6 11.9

IP

(14,738)

yes 65.7 22.7 59.8 84.2

no 17.9 60.6 33.9 10.4

sub 16.2 16.7 5.5 0.0

null 0.2 0.0 0.7 5.4

Treebank [Xue et al., 2002] parse trees. The Chinese word segmentation from the alignment

steps in section 5.3 is retained.

Table 5.6 shows the same systems as the previous section, but using instead the spans

labeled4 by the parser as NP, VP, or IP (noun, verb, or inflectional phrase; IP is the Chinese

Treebank equivalent to a sentence or small clause). We choose these categories because they

are three core non-terminal categories of the treebank, each with a strong and relatively

theory-agnostic linguistic basis. Furthermore, these three categories together make up more

than 70% of the non-terminals in the parse trees produced by the automatic parsers used

in these experiments. It is reasonable to expect that most of these phrases are coherent in

the reference alignment, and indeed they are (72.4% coherent NPs, 64.4% coherent VPs,

4Spans that cover the entire sentence are not included in these counts; by definition such spans are alwayscoherent, but this is not informative.

92

and 65.7% coherent IPs).

Again, we may observe in table 5.6 that the berkeley system’s coherence is interme-

diate between the giza.union and giza.intersect systems’ coherence values. For these

syntactic spans, the berkeley alignments are much closer to the human labels than the

giza.intersect, which substantially overpredicts coherence of these smaller units. How-

ever, berkeley alignments overpredict incoherent spans on VPs and IPs, and giza.union

also overpredicts incoherence on NPs. Together, these results suggest that the union align-

ments are too link-dense, the intersect alignments too sparse, and the berkeley align-

ments just about right — although berkeley seems to still make syntactically-unaware

errors, inducing incoherent spans.

It is interesting to note that the giza.intersect results are actually over-coherent,

due to their link density, but that alignment also has a worse AER (due to its low recall).

Accordingly, high coherence is not necessarily neatly correlated with improvements to AER.

5.4.3 Syntactic cues to coherence

We find in the previous two subsections that (for example) though roughly 83.5% of tadpole

spans are coherent under the LDC (reference alignments), only about 65% of IP and VP

spans are coherent in the reference alignments. These proportions are low enough, for the

syntactic classes, to suggest inquiry into what characteristics indicate that a span of given

syntactic label XP is likely to be coherent. To explore this question, we build a binary

decision tree using the WEKA toolkit [Hall et al., 2009] over each collection of XP spans

(where XP ∈ {NP,VP, IP}), where the decision tree is binary over the following syntactic

features from that XP ’s structure:

• whether that span is also a tadpole-delimited span,

• the syntactic tags of that XP ’s syntactic children,

• the syntactic tags of that XP ’s syntactic parents, and

• the length of the XP in question.

93

VP (64% coherent)

NT-parent = CP (2,291)

29.7% coherent

NT-parent 6= CP (34,876)

66.6% coherent

(a) VP tree

IP (65.7% coherent)

NT-parent = CP (3,362)

29.9% coherent

NT-parent 6= CP (11,376)

75.6% coherent

(b) IP tree

Figure 5.3: Decision trees for VP and IP spans. The decision tree did not find error-reducing

distinctions among the NP spans.

These features were chosen as a reasonable characterization of the syntactically-local in-

formation (roughly parallel to the information provided on arc-links in the SParseval

measure in chapter 3). Although some IP spans (unsurpisingly) cover the entire sentence,

spans over the entire sentence are not included in this analysis (whole-sentence spans are

always coherent, by definition). Figure 5.3 shows the top forks of the VP and IP decision

trees (the single decision offering the greatest error reduction). We may observe an inter-

esting commonality: in both IPs and VPs, the majority of spans with a parent label of

CP (“complementizer phrase”) are not coherent in the reference alignment. Anecdotally,

the CP-over-IP construction seems to occur incoherently in the bitext when the CP-over-IP

marks an construction that is divided in English, e.g. the example in figure 5.4. This kind

of construction, which may be expressed in English as a pre-and-post-modified NP (“[np

[np the largest X] in the world]”) or a left-modified NP (“[np the world’s largest X]”) is

likely to be incoherent, despite having a uniform analysis as CP-over-IP in Chinese (“[cp

[ip 世界最大 X]]”). It is also worth observing that the IP node in this example has a

unary expansion to a VP predicate (with two parts), and so accounts for some of the same

94

Thailand be world most big comp rice export+nation .

泰国是世界最大的稻米出口国。

Thailand is the largest rice exporter in the world .

ip1cp1

Figure 5.4: An example incoherent CP-over-IP. Note that ip1’s reprojection to English is

actually larger than ip1 itself (since it includes “rice exporter” within its English projection

span) and larger than cp1 in which it is embedded. Had the English translation chosen the

phrasing “the world’s largest rice exporter”, ip1 would be coherent.

incoherent CP-over-VP spans in figure 5.3(a) as well.

It is also of note (though not visible in figure 5.3) that the tadpole feature was available

to the decision tree but was never selected, even when the tree was allowed to ramify further,

suggesting that this orthographic information is not useful in determining the coherence of

these syntactic spans.

From table 5.6 and figure 5.3, we may see that the base rate of coherence for NPs, VPs

and IPs (at least, those not immediate children of CPs) is at about 70% for each, with IPs

being particularly promising — at 76% coherence — but relatively rare.

5.4.4 A qualitative survey of incoherent IP spans

Inspired by the example in figure 5.4, we extracted 50 spans labeled IP by the parser that

were incoherent according to the ldc reference alignment. The categories suggested here

were selected to attempt to characterize the reasons that these parser-indicated spans are

incoherent. The most common category of differences (about 30%) were alignments where

a clause-modifying adverb was used in English somewhere other than the left or right edge

of the clause (and where in Chinese, the clause-modifier lives outside the IP). One common

scenario among those examined here was a clause-external Chinese adverb that is aligned

95

Table 5.7: Some reasons for IP incoherence

Reason n

Sentential adverb between subject and main verb in English 14

IPs in conjunction: English language ellipsis; Chinese re-

peated word

8

Two-part predicates in Chinese pre- and post-modify noun

in English

6

Punctuation differences (periods inside quotes) 3

Other translation divergences 10

Parsing attachment errors introducing incoherence 9

with an English adverb after the first (finite) verb in an English clause, as in figure 5.5.

Nearly 20% of the incoherent spans were incoherent because of parsing attachment errors,

usually because a Chinese adverb was attached low within an adjacent small IP when it

should have been considered an sentential modifier. Improving parser performance on the

correct attachment of clausal adverbs would be valuable here.

Another key challenge is found when Chinese and English disagree on whether all com-

ponents of a conjunct need to be repeated. Although Chinese omits pronouns in many

circumstances, it was the English conjunction of subject-less VPs that introduced incoher-

ence. About 16% of the incoherent IP spans are attributable to ellipsis in the English side

(alternatively, a choice to repeat a term in Chinese which is left out in English), e.g. as

in figure 5.6. The remaining categories of incoherent IP include Chinese two-part CP/IP

predicates that pre- and post-modify a noun in English (about 12%, as in figure 5.4) and a

small variety of others.

5.5 Selecting whole candidates with a reranker

In previous sections, we see evidence that the coherence of certain categories corresponds to

alignment quality, and that it is — at least in principle — possible to select an alignment

with better AER from the pool of candidates: the oracle scores in table 5.4 demonstrate

96

IP

NP

该指数

this index

VP

ADVP

AD

也有

sometimes

VP

LB

被

by

IP

NP

投资者

investors

VP

用作投资指南

use as investment-guide

“This index has sometimes been used by investors as an investment guide.”

Figure 5.5: An example of clause-modifying adverb (也有) appearing inside a verb chain.

Note that ldc alignment links link the lowest VP to English has, been, and used, so that

the projection of the lower IP contains the projection of the upper ADVP and LB spans.

that a better AER is possible.

As in chapter 3, we establish a reranking paradigm, where alignment candidates are

converted into feature vectors by a feature extraction steps, and (in training) candidates’

optimization target (in this case, AER) is converted to a rank for training an svmrank

learner over the training candidates. The learner ranks the candidates in the pool, and we

report AER over the learner’s choice of top-ranked candidates. Because of the relatively

small amount of labeled data, we report results here over ten-fold cross-validation.

This arrangement allows variation in two key experimental variables:

Candidate pool : We may include candidates from all of the rerankers available, or only

certain subsets. We define two pools of interest:

• experts is the “committee of experts” made up of all of the direct outputs of

the automatic aligners (berkeley, giza.e-f, giza.f-e, giza.intersect, and

giza.union)

97

TO

P

IP

NP

NR

俄罗斯

Ru

ssia

IP

NP

1997年估计

“199

7-es

tim

ated

”

VP

VV

增长

grow

th

QP

CD

百分之零点五

0.5

%

PU , ,

IP

NP

1998年预计

“1998-e

stim

ate

d”

VP

VV

增长

gro

wth

QP

CD

百分之一点五

1.5

%

PU 。 .

“Ru

ssia

’ses

tim

ated

grow

this

0.5%

for

1997

,an

d1.

5%

for

1998

.”

Fig

ure

5.6

:A

nex

am

ple

of

En

glis

hel

lip

sis

wh

ere

Ch

ines

ere

pea

tsa

wor

d(增长

,“g

row

th”)

.T

he

En

glis

htr

ansl

atio

nh

ason

ly

on

e“g

row

th”.

To

lin

kb

oth增长

nod

esto

the

En

glis

h“g

row

th”

requ

ires

atle

ast

one

ofth

elo

wer

IPnod

esto

be

inco

her

ent.

98

• giza.union.NBEST is the pool generated by performing the union operation on

the members of giza.e-f.NBEST and giza.f-e.NBEST list.

Feature selection : We may use any of a variety of features to rerank members of the

candidate pool. We define the following features:

• voting features include a binary feature for each expert system (e.g., berkeley

or giza.union); if a candidate is generated by that system, this feature will be

true; otherwise false.

• span-X features represent four features: the proportion of spans of type X that

are coherent (span-X-yes), subcoherent (span-X-sub), null-coherent (span-X-null),

or incoherent (span-X-no). For example, we may use span-NP features, which

provide features describing the coherence (or non-coherence) of the noun-phrases

in the sentence.

5.5.1 Selecting candidates from expert committee

We conjecture that a reranking approach would help to select better alignments from a pool

of alignment candidates generated by a diverse set of rerankers. In this experiment, the

alignment candidates berkeley, giza.e-f, giza.f-e, giza.intersect, and giza.union

are included in the pool to be reranked. For reranker features, we consider the span-comma,

span-tadpole, and span-IP features. Because of the IP coherence patterns gleaned from

figure 5.3(b), we further include span-nonCP-IP, which includes features of those IP spans

that are not the direct children of CP constituents. Finally, we include an experiment using

span-NT features, which are the set of features including span-X for all non-terminal

symbols used in the Chinese treebank (span-IP, span-DNP, span-QP, etc.). For all learners,

we include the voter feature to allow the reranker to include a learned estimate of the

quality of each committee member. As a baseline, we include a voter-only feature set,

which learns, as one might expect, to always select the committee member with the best

overall AER).

Table 5.8 shows the results of reranking the members of the committee of quality experts.

99

Table 5.8: Reranking the candidates produced by a committee of aligners.

Identity AER Precision Recall

isolated

berkeley 32.87 84.21 55.81

giza.e-f 36.46 70.14 58.08

giza.f-e 40.31 76.78 48.82

giza.union 35.42 63.34 65.87

giza.intersect 42.37 96.78 41.04

voter only (∼berkeley) 32.87 84.21 55.81

voter &

span-tadpole 32.95 83.49 56.02

span-comma 33.13 82.00 56.46

span-IP 33.10 82.69 56.18

span-nonCP-IP 33.02 82.81 56.23

span-NT 34.09 76.65 57.80

(per-sentence) oracle 30.12 80.01 62.02

Relative to the span-IP features, the span-nonCP-IP features have a larger improvement in

recall with a smaller loss to precision, which suggests that using spans which are generally

expected to be coherent may be helpful in this kind of reranking. However, we also observe

that none of the rerankers (in the lower half of the table) actually reduce AER from the

baseline (voter only) systems: instead, they boost recall, at varying costs to precision.

We also observe that the more features are involved, the larger the effect on recall, with

span-NT having the largest impact. For those cases with the same number of features (e.g.

span-IP and span-tadpole), the features with more spans in the data generally have larger

impact. In retrospect, this is unsurprising: the next best systems, beyond berkeley, are

giza.union and giza.e-f, which each have lower precision and higher recall: thus, when

the reranker chooses an alternative, it is usually choosing one of those two, improving recall

(and hurting precision). We even see this in the oracle: though its AER is superior to the

berkeley system, its precision is lower.

100

Table 5.9: Reranking the candidates produced by giza.union.NBEST.

Identity AER Precision Recall

nbranks only (∼giza.union) 35.42 63.34 65.87

nbranks &

span-tadpole 35.41 63.35 65.87

span-comma 35.41 63.36 65.88

span-IP 35.42 63.34 65.87

span-NT 35.38 63.39 65.89

giza.union.NBEST (per-sentence) oracle 32.44 66.58 68.56

5.5.2 Selecting candidates from N -best lists

To avoid the problems with very-different second-best candidates (with very different pre-

cision and recall) suggested by the previous experiments, we construct a separate exper-

iment with only the alignments generated by giza.union.NBEST, which (as described in

section 5.3) include two new rank features (dubbed nbranks): the rank of the giza.e-f

member of the union and the rank of the giza.f-e member of the union.

Table 5.9 shows none of the precision-recall imbalance present in the experiments in

table 5.8. However, the coherence features do not seem to make much difference, only

nudging both precision and recall (non-significantly) higher.

5.5.3 Reranker analysis

In sections 5.5.1 and 5.5.2, we explored using a reranker to select candidates from a pool of

generated candidates. System combination at the level of whole candidate selection, as in

these experiments, works best when the systems under combination have similar operating

performance (similar quality) while also being diverse (making different kinds of errors).

In this perspective, the analyses here suggest that the committee of experts (section 5.5.1)

performed well in generating diversity, but the single best member of the committee (the

berkeley system) so outperformed its fellows that alternates only rarely improved the over-

all AER. Conversely, the experiments in section 5.5.2 selected from the giza.union N -best

101

Table 5.10: AER, precision and recall for the bg-precise alignment

System AER Precision Recall

berkeley 32.87 84.21 55.81

giza.intersect 42.37 96.78 41.04

bg-precise (berkeley ∪ giza.intersect) 32.38 83.91 56.62

lists, where the criterion of similar quality was met, but the candidates were insufficiently

diverse. We conjecture that the lack of improvement in AER from reranking is due to these

problems. However, it may be that the coherence features are not sufficiently powerful to

distinguish the candidates without incorporating lexical or other cues, since the berkeley

aligner coherence percentages of the different phrases are not so different from the ldc

percentages.

5.6 Creating hybrid candidates by merging alignments

Whole-candidate selection from the previous section suggests that the available candidates

are insufficiently diverse (when chosen from the N -best lists) and too dissimilar in perfor-

mance (when chosen from the committee of expert systems). As an alternative strategy, we

may perform partial-candidate selection, by constructing hybrid candidates, guided by the

syntactic strategies suggested here.

The analysis in section 5.3.2 shows that the berkeley and giza.intersect systems

are very high precision, but both have relatively low recall. By contrast, giza.union has

the best recall, but its precision suffers. We cast the problem of sub-sentence alignment

combination as a problem of improving the recall of a high-precision alignment. As a first

baseline, we may combine (union) berkeley and giza.intersect, the two high-precision

alignments from table 5.4, into a new precision alignment bg-precise, shown in table 5.10.

The bg-precise alignment has a lower precision than either of its component high-precision

alignments, but yields the best AER thus far, because of improvements to recall. This simple

combination, in fact, yields an AER better (although not significantly better) than the per-

sentence oracle AER from the giza.union.NBEST selection, providing further evidence that

102

the N -best lists are insufficiently diverse for reranking as they are.

5.6.1 Using “trusted spans” to merge high-precision and high-recall alignments

Although bg-precise improves recall to some degree, we would like to improve recall fur-

ther. The giza.union alignments have substantially better recall than any of the precision

alignments, so we adopt the strategy of merging only certain alignment links from the

giza.union alignment into the bg-precise alignments.

We introduce the guideline of a “trusted span” on the source text, and define the guided

union over a high-precision alignment and a high-recall alignment and a set of trusted spans:

all links from the recall alignment that originate within one of the trusted spans, unioned

with all links from the precision alignment. This combination heuristic changes the problem

of combining alignments to the problem of identifying “trusted spans” from the source text.

5.6.2 Defining the syntactic “trusted span”

We may thus use the syntactic coherence analysis from section 5.4 to describe “trusted

spans” to be used in the guided union operation, and evaluate the resulting guided union

alignment according to the same AER metrics we have used throughout this chapter.

We extract syntactic trusted spans in a bottom-up recursion from the syntactic tree,

defining trusted XP spans with the following heuristic: an XP span s is trusted for the

process of the guided union between a precision-oriented alignment P and a recall-oriented

alignment R when

• s is coherent in P ,

• s is coherent in R,

• all XP spans contained within s are also trusted.

These spans define a guided union PXP∪ R between precision-oriented alignment P and

recall-oriented alignment R. Thus we may define, for example, the NP-trusted guided union

103

(a) Precision alignment

China high+new tech. develop+zone prepare p 80+decade early

中国高新技术开发区酝酿于八十年代初。

The new high level technology development zones of China were brewed in the early 1980 ’s .

np-max

npnp

np np

(b) Resulting alignment. (Dashed lines are new.)

China high+new tech. develop+zone prepare p 80+decade early

中国高新技术开发区酝酿于八十年代初。

The new high level technology development zones of China were brewed in the early 1980 ’s .

np-max

npnp

np np

Figure 5.7: Example of an np-guided union. The precision alignment (a) and the recall

alignment (b) both agree that each np span is coherent (and all np sub-spans are coherent).

np-max may thus be used as a trusted span, allowing us to copy the heavy dashed links

from the recall alignment into the union.

104

Table 5.11: AER, precision and recall over the entire test corpus, using various XP -

strategies to determine trusted spans

System AER Precision Recall

Recall system = R giza.union 35.42 63.34 65.87

Precision system1 = P1 giza.intersect 42.37 96.78 41.04

P1XP∪ R, where XP=

IP 41.23 92.71 43.02

VP 41.16 93.07 43.01

IP or VP 41.05 92.91 43.17

NP 39.56 93.82 44.57

NP or VP 39.23 93.00 45.13

Precision system2 = P2 bg-precise 32.38 83.91 56.62

P2XP∪ R, where XP=

IP 32.29 82.61 57.36

VP 32.17 82.77 57.47

IP or VP 32.15 82.73 57.50

NP 31.61 83.16 58.07

NP or VP 31.51 82.90 58.35

(Pnp∪ R): the maximal NPs that are coherent in both P and R such that all descendant

NPs are also coherent.

Figure 5.7 illustrates an NP-guided union Pnp∪ R, in which we can see, at least anecdo-

tally, that it is reasonable to expect that this syntactic mechanism for selecting trustworthy

links to be helpful in extending a high-precision alignment by improving recall without

hurting precision much.

Table 5.11 shows the results of using this guided union heuristic to generate new align-

ment candidates, using two different alignments (giza.intersect and bg-precise) for the

P role and giza.union as the R role. We see the same trends for each choice of P align-

ment: using Pnp∪ R has the smallest reduction in precision. It also has the second-largest

improvement in recall, with the best performance going to the union-guide that uses both

105

VP and NP spans to form the trusted spans. By contrast, while using Pvp∪ R or P

ip∪ R for

guided unions both generate a reduction in AER, this reduction is small (and using both

VP and IP together seems to make little improvement, probably because VP and IP spans

are often nested).

However, guided union approaches are not sufficiently powerful to overcome the ex-

tremely low link-density of the giza.intersect alignment — precision and recall trade

off nearly four percentage points but the corresponding improvement (due to moving the F

measure towards balance) is not sufficient to bring AER below the giza.union AER. When

the precision measure begins more balanced (as in bg-precise), the guided union effects

can drive AER to new minima: using np and vp spans to guide the trusted-span selection

produces the best overall AER of 31.51%.

5.7 Discussion

In this chapter, we have presented a new formalism for quantifying the degree to which a

bitext alignment retains the coherence of certain spans on the source language. We evaluate

the coherence behavior of some orthographically- and syntactically-derived classes of Chi-

nese spans on a manually-aligned corpus of Chinese-English parallel text, and we identify

certain classes (motivated by orthography and syntax) of Chinese span that have consis-

tent coherence behavior. We argue that this coherence behavior may be useful in training

improved statistical machine translation systems by making a difference in improving sta-

tistical machine-translation alignments.

To improve alignments, this chapter explored the potential for alignment system combi-

nation, following at first the approach of choosing candidates from a committee of experts or

from the N -best lists generated by the GIZA++ toolkit (using a reranking toolkit). These

initial experiments found that the needs of system combination (systems of rough parity,

with usefully different kinds of errors) were not met, and we turned to sub-segment system

combination. In this approach, we define a syntax-guided alignment hybridization between

a high-precision and a high-recall alignment, and show that the resulting alignments, hy-

bridized with guidance from syntactic structure, have a better performance in AER than

106

the best alignments produced by the component expert systems.

These results, taken together, suggest that source-side grammatical structure and co-

herence can be a useful cue to quality alignment links in producing good alignments for the

training of statistical machine translation engines.

107

Chapter 6

CONCLUSION

The work in this dissertation is motivated by the hypothesis that linguistic dependency

and span parse structure is an informative and powerful representation of human language,

so much so that accounting for parse structure will be useful even to those applications where

only a word sequence is produced, e.g. a speech transcript or a text translation. In support

of this hypothesis, this work has presented three ways that parse structure (as provided

by statistical syntactic parsers) may be engaged with these large sequence-oriented natural

language processing systems. This chapter summarizes the work presented here (section 6.1)

and suggests directions for further study of these research areas individually (section 6.2)

and for parsing as a general-purpose parse decoration tool for these and related applications

(section 6.3).

6.1 Summary of key contributions

Chapter 3 demonstrated that it is possible to improve the performance of a speech recog-

nizer and a parser by rescoring the two systems jointly: the speech-recognizer’s output is

improved (in terms of WER) by exploiting information from parse structure, and the parse

structure resulting from parsing the speech recognizer’s output may be improved by con-

sidering alternative transcript hypotheses while evaluating resulting parses. This research

also found that the utility of the parse structure was strongly dependent on the quality of

the speech segmentation: parse structure was much more valuable in the context of high-

quality segment boundaries than in the context of using default speech recognizer segment

boundaries. In addition, we present a qualitative analysis of the use of parse structure

in selecting transcription hypotheses, finding improvement, for example, in the prediction

of pronouns and the main sentential verb, which would be critical for use in subsequent

linguistic application.

108

Chapter 4 applies parse structure to a rather different domain: the evaluation of sta-

tistical machine translation. Like speech recognizers, SMT evaluation is dominated by

sequence-focused models, but this work introduces an application of syntax to SMT eval-

uation: a parse-decomposition measure for comparing translation hypotheses to reference

translations. This measure correlates better with human judgement of translation quality

than BLEU4 and TER, two popular sequence-based metrics. We further explore combining

the new technique with other cutting-edge SMT evaluation techniques like synonym and

paraphrase tables and show that the combined techniques (syntax and synonymy) perform

better than either alone, although the gains are not strictly additive.

Chapters 3 and 4 both explore the utility of considering (probability-weighted) alterna-

tive parse hypotheses when using parse structure in their tasks. In the parsing-speech tasks

of chapter 3, using this extra information in joint reranking with recognizer N -best tran-

scription hypotheses shows trends in the direction of improvement (but not to significance,

possibly because the number of parameters exceeded the reranker’s ability to exploit them),

but chapter 4’s machine translation evaluation showed that including additional parse hy-

potheses clearly improved EDPM’s ability to predict the quality of machine translations.

While the parsing-speech work (chapter 3) uses both span information and dependency

information from the parses in comparing parse information for reranking, the SMT eval-

uation work in chapter 4 focuses on dependency (even to the point of putting it in the

name of the Expected Dependency Pair Match metric). By contrast, the work in chap-

ter 5 focuses on a use of constituent structure for an internal component of SMT: word-

alignment. Chapter 5’s research conducts an analysis which demonstrates the tendencies

of a particular class of spans (e.g., those motivated by syntactic constituency) to hold to-

gether in a given alignment. It explores the use of this constituent measure to select an

alignment hypothesis from a pool of alignment candidates. Although the reranking ap-

proach has limited success (because the available candidates are too dissimilar in overall

quality), the coherence measure illuminates some characteristics of quality alignments that

further work on word alignment might pursue. Chapter 5 also discovered a technique for

using these characteristically-coherent spans as a guidance framework for alignment com-

bination, through a guided union of a precision-oriented alignment and a recall-oriented

109

alignment. Syntactic coherence (using this guided-union approach) was demonstrated to be

useful in improving the AER of the alignments by this technique. The effects of syntactic

constituent coherence are probably even stronger than indicated by these results, since a

qualitative analysis in that chapter identified that a sizable minority of incoherent IP spans

were incoherent due to parse-decoration error.

6.2 Future directions for these applications

Chapters 3, 4, and 5 offer three different approaches to the using parse structure to improve

natural language processing. The results presented here suggest future work in applying

parsers to each of these areas of research.

6.2.1 Adaptations for speech processing with parsers

In the domain of parsing speech (chapter 3), it would be valuable to explore the impact

of additional parse-reranking features, especially those more directly focused on speech.

The features extracted in this work were a re-purposing of the feature extraction used for

reranking parses over text; it might be valuable to include features that are more directly

targeting the challenges of speech transcription. For example, explicit edit or disfluency

modeling, as in Johnson and Charniak [2004], or prosodic features, as in Kahn et al. [2005],

might be useful in further improving the reranking available here. Alternatively, including

parse structure from parsers using other structural paradigms (e.g. the English Resource

Grammar [Flickinger, 2002]) would be an alternative valuable knowledge source (further

discussed in section 6). Along similar lines, expanding the joint modeling of speech tran-

scription and parsing to include sentence segmentation (as in Harper et al. [2005]) might

be valuable, especially because the evidence presented here points so strongly towards the

need for improved segmentation.

6.2.2 Extension of EDPM similarities to other tasks

In extending EDPM, it would be interesting to consider whether these techniques could be

shared with other tasks that require a sentence similarity measure. EDPM substantially

110

outperformed BLEU4, an n-gram precision measure, on correlation with human evaluation

of machine-translation quality. In the summarization domain, ROUGE [Lin, 2004] uses n-

grams to serve as an automatic evaluation of summary quality; EDPM’s generalization of

this approach to use expected parse-structure is worth exploring in summarization as well.

Even within machine translation, EDPM’s notion of sentence similarity may be useful

in other ways, for example, in computing distances between training and test data in graph-

based learning approaches for MT (e.g. Alexandrescu and Kirchhoff [2009]).

6.2.3 Extending syntactic coherence

The coherence measures reported in chapter 5 suggest that one may be able to parse source

text alone to identify regions that are translated in relative isolation from one another.

However, coherence of those spans by itself does not indicate that the alignment quality

is good: a key factor is in the relative density (the proportion of links to words), since

high-density alignments seem to under-predict coherence and low-density alignments to

over-predict it. We suggest exploring a revised reranking, including link density as a feature

alongside (possibly weighting) coherence.

Furthermore, the guided-union work showed a welcome success in improving the recall

without greatly damaging precision by using source language (Chinese) parsing. As an

extension, parse structure from the target language (English) could also be used to iden-

tify regions where alignment unions are worth including. Parsing the target side would

require a different parser, trained on the target language, which would identify target spans

(rather than source-side spans) to trust in guiding alignment union. Since the analysis in

section 5.4.4 indicated that some of the incoherent regions could be explained by English

constructions, this approach might be particularly fruitful.

Further work to integrate the notion of span coherence into machine translation align-

ments would be valuable: identifying that a span is likely to be coherent in translation

should offer a criterion for augmenting the search space pruning strategy for good transla-

tions. However, it would be wise to do further analysis of what regions are coherent before

undertaking the substantial effort of incorporating coherence into a translation or alignment

111

decoder. Such an analysis might incorporate a lexicalized study of coherence extending the

syntactic-span study done in section 5.4.2.

Beyond improving AER, both the reranking-alignments and the guided-union techniques

may show further improvement in alignment quality (or demonstrate a need for adjust-

ment) when dealing with alternative measures of alignment quality. One computationally-

expensive technique would be to use direct translation evaluation: to evaluate the alignment-

selection by training the entire translation model from the generated alignment and evaluate

with a translation quality measure on a held-out set of texts.

6.3 Future challenges for parsing as a decoration on the word sequence

Each of the applications described here was developed with a PCFG-based parser which

produces a simple span-labeling output. The parsers used here were all trained on in-domain

data, with state-of-the-art PCFG engines. As a direction of future work, it is worthwhile

to explore which of these constraints is necessary and which may be improved by trying

alternatives.

6.3.1 Sensitivity to parser

On any of the three applications presented here, one could explore varying the parser.

Alternative PCFG-based systems may present a different variation (their N -best lists, for

example, may be richer than the high-precision systems used here). However, one could go

farther and explore non-PCFG parsers. Any parser that can generate a dependency tree

could be used for EDPM, and any parser that can generate spans with labels could be used

in the coherence experiments. The reranking experiments over speech require that the parse

trees generated be compatible with the feature-extraction, but if one is willing to adjust the

feature extraction as well, any parser could be used there too. One direction of approach

might be to generate dependency structures directly for EDPM, e.g. by using dependency

parser strategies like the ones described in Kubler et al. [2009].

Alternatively, the English Resource Grammar [Flickinger, 2002] produces a rich parse

structure that may be projected to span or dependency structure; recent work (e.g. Miyao

and Tsujii [2008]) has suggested that it may even generate probabilistically weighted tree

112

structures. Integrating this knowledge-driven parser into these experimental frameworks

(as a replacement or supplemental parse-structure knowledge source) would be a valuable

exploration of the relative merits of these parsers.

6.3.2 Sensitivity of these measures to parser domain

We expect that the training domain is relevant to a parser’s utility for these applications

in ASR and SMT. In the limit, if the parser is trained on the wrong language, most of the

information it offers to these measurement and reranking techniques will be lost. However,

it is not clear how closely dialect, genre, and register must be matched: is it workable to

use a newswire-trained parser in EDPM when comparing translations of a more informal

genre (e.g. weblogs or conversational speech)? For some applications, genre may not have

an impact on the useful performance of the parser, and for others it may have a substantial

one: it would be a useful contribution to explore whether the benefits are retained when

parser domains mismatch.

6.3.3 Sensitivity of these measures to parser speed and quality

Parse structure decoration is shown here to be a valuable supplement to large word-sequence-

based NLP projects: this work offers a variety of opportunities for further work exploring

new ways in which parse structure decoration may benefit large NLP projects.

In evaluation, an obstacle to the wider adoption of dependency-based scoring functions

such as EDPM (for MT) and SParseval (for ASR) is a concern for scoring speed. Systems

that use error-driven training or tuning require fast scoring in order to do multiple runs of

parameter estimation for different system configurations. Using a parser is much slower than

scoring based on word and word n-gram matches alone. This objection invites exploration

regarding the robustness of dependency-based scoring algorithms when a faster (though

presumably lower-quality) parser is used rather than the Charniak and Johnson [2005]

system; perhaps (rather than the PCFG-inspired system used here) a direct-to-dependency

parser (e.g. the Kubler et al. [2009] parser) would capture enough similar information at

high enough quality to offer the same performance in SMT evaluation.

113

Speed, of course, would be of benefit to any application of parsing: the use of syntac-

tic coherence as a feature of word-alignment in machine translation would also be more

appealing if the benefits were present with a much faster parser.

Parser error, of course, can be a serious problem, as the qualitative study of Chinese

coherence analyses indicated. A different way of approaching sensitivity to parser quality

would be to create an array of parsers of known variation in quality (perhaps by using

reduced training sets) and exploring the relative merit of each in the tasks presented here.

In general, the experiments presented in this work suggest that parsers provide a useful

knowledge source for natural language processing tasks in several areas. Improving the

parser would, one expects, make that knowledge source more valuable, although it may

be that the environments (e.g., the candidates to be reranked) are not sufficiently diverse

for that additional knowledge to be valuable. In either case, this work stands as a call to

continue exploration for both parsers and natural language processing tasks in which to

apply those parsers.

115

BIBLIOGRAPHY

Y. Al-Onaizan and L. Mangu. Arabic ASR and MT integration for GALE. In Proc. ICASSP,

volume 4, pages 1285–1288, Apr. 2007.

A. Alexandrescu and K. Kirchhoff. Graph-based learning for statistical machine translation.

In Proc. HLT/NAACL, pages 119–127, 2009.

E. Arisoy, M. Saraclar, B. Roark, and I. Shafran. Syntactic and sub-lexical features for

Turkish discriminative language models. In Proc. ICASSP, pages 5538–5541, Mar. 2010.

N. F. Ayan and B. J. Dorr. Going beyond AER: An extensive analysis of word alignments

and their impact on MT. In Proc. ACL, pages 9–16, July 2006.

N. F. Ayan, B. J. Dorr, and C. Monz. NeurAlign: Combining word alignments using neural

networks. In Proc. HLT/EMNLP, pages 65–72, Oct. 2005a.

N. F. Ayan, B. J. Dorr, and C. Monz. Alignment link projection using transformation-based

learning. In Proc. HLT/EMNLP, pages 185–192, Oct. 2005b.

S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved

correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic

Evaluation Measures for MT and/or Summarization, pages 65–72, 2005.

E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. In-

gria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and

T. Strzalkowski. A procedure for quantitatively comparing syntactic coverage of English

grammars. In Proc. 4th DARPA Speech & Natural Lang. Workshop, pages 306–311, 1991.

J. Bresnan. Lexical-functional syntax. Number 16 in Blackwell textbooks in linguistics.

Blackwell, Malden, Mass., 2001.

116

P. F. Brown, J. Cocke, S. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L.

Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational

Linguistics, 16(2):79–85, 1990.

D. Burkett, J. Blitzer, and D. Klein. Joint parsing and alignment with weakly synchronized

grammars. In Proc. HLT, pages 127–135, June 2010.

A. Cahill, M. Burke, R. O’Donovan, J. van Genabith, and A. Way. Long-distance depen-

dency resolution in automatically acquired wide-coverage PCFG-based LFG approxima-

tions. In Proc. ACL, pages 319–326, 2004.

C. Callison-Burch. Re-evaluating the role of BLEU in machine translation research. In

Proc. EACL, pages 249–256, 2006.

P.-C. Chang, M. Galley, and C. D. Manning. Optimizing Chinese word segmentation for

machine translation performance. In Proceedings of the Third Workshop on Statistical

Machine Translation, pages 224–232, June 2008.

E. Charniak. A maximum-entropy-inspired parser. In Proc. NAACL, pages 132–139, 2000.

E. Charniak. Immediate-head parsing for language models. In Proc. ACL, pages 116–123,

2001.

E. Charniak and M. Johnson. Edit detection and parsing for transcribed speech. In Proc.

NAACL, pages 118–126, 2001.

E. Charniak and M. Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative

reranking. In Proc. ACL, pages 173–180, June 2005. A revised version was downloaded

November 2009 from ftp://ftp.cs.brown.edu/pub/nlparser/.

E. Charniak, K. Knight, and K. Yamada. Syntax-based language models for statistical

machine translation. In MT Summit IX. Intl. Assoc. for Machine Translation., 2003.

C. Chelba and F. Jelinek. Structured language modeling. Computer Speech and Language,

14(4):283–332, October 2000.

117

C. Cherry. Cohesive phrase-based decoding for statistical machine translation. In Proc.

ACL, pages 72–80, June 2008.

D. Chiang. A hierarchical phrase-based model for statistical machine translation. In Proc.

ACL, pages 263–270, June 2005.

M. Collins. Discriminative reranking for natural language parsing. In Proc. ICML, pages

175–182, 2000.

M. Collins. Head-driven statistical models for natural language parsing. Computational

Linguistics, 29(4):589–638, 2003.

M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computa-

tional Linguistics, 31(1):25–69, 2005.

M. Collins, P. Koehn, and I. Kucerova. Clause restructuring for statistical machine trans-

lation. In Proc. ACL, pages 531–540, June 2005a.

M. Collins, B. Roark, and M. Saraclar. Discriminative syntactic language modeling for

speech recognition. In Proc. ACL, pages 507–514, June 2005b.

M. R. Costa-jussa and J. A. R. Fonollosa. Statistical machine reordering. In Proc. EMNLP,

pages 70–76, July 2006.

C. Culy and S. Z. Riehemann. The limits of n-gram translation evaluation metrics. In

Proceedings of MT Summit IX, 2003.

DARPA. Global Autonomous Language Exploitation (GALE). Mission, http://www.

darpa.mil/ipto/programs/gale/gale.asp, 2008.

J. DeNero and D. Klein. Tailoring word alignments to syntactic machine translation. In

Proc. ACL, pages 17–24, June 2007.

D. Filimonov and M. Harper. A joint language model with fine-grain syntactic tags. In

Proc. EMNLP, pages 1114–1123, Aug. 2009.

http://www.darpa.mil/ipto/programs/gale/gale.asp

http://www.darpa.mil/ipto/programs/gale/gale.asp

118

D. Flickinger. On building a more efficient grammar by exploiting types. In S. Oepen,

D. Flickinger, J. Tsujii, and H. Uszkoreit, editors, Collaborative Language Engineering,

chapter 1. CSLI Publications, 2002.

V. Fossum, K. Knight, and S. Abney. Using syntax to improve word alignment precision for

syntax-based machine translation. In Proceedings of the Third Workshop on Statistical

Machine Translation, pages 44–52, June 2008.

A. Fraser and D. Marcu. Semi-supervised training for statistical word alignment. In Proc.

ACL, pages 769–776, July 2006.

A. Fraser and D. Marcu. Measuring word alignment quality for statistical machine transla-

tion. Computational Linguistics, 33(3):293–303, Sept. 2007.

M. Galley, M. Hopkins, K. Knight, and D. Marcu. What’s in a translation rule? In D. M.

Susan Dumais and S. Roukos, editors, Proc. HLT/NAACL, pages 273–280, May 2004.

M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang, and I. Thayer. Scalable in-

ference and training of context-rich syntactic translation models. In Proc. COLING/ACL,

pages 961–968, July 2006.

D. Gildea. Loosely tree-based alignment for machine translation. In Proc. ACL, pages

80–87, July 2003.

J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus

for research and development. In Proc. ICASSP, volume I, pages 517–520, 1992.

J. T. Goodman. A bit of progress in language modeling. Computer Speech and Language,

15:403–434(32), Oct. 2001.

A. Haghighi, J. Blitzer, J. DeNero, and D. Klein. Better word alignments with supervised

ITG models. In Proc. ACL, pages 923–931, Aug. 2009.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA

data mining software: an update. SIGKDD Explorations Newsletter., 11:10–18, Nov.

2009.

119

M. Harper and Z. Huang. Chinese Statistical Parsing, chapter in press. DARPA, 2009.

M. Harper, B. Dorr, J. Hale, B. Roark, I. Shafran, M. Lease, Y. Liu, M. Snover, L. Yung,

A. Krasnyanskaya, and R. Stewart. Parsing and spoken structural event detection. Tech-

nical report, Johns Hopkins Summer Workshop Final Report, 2005.

D. Hillard. Automatic Sentence Structure Annotation for Spoken Language Processing. PhD

thesis, University of Washington, 2008.

D. Hillard, M. yuh Hwang, M. Harper, and M. Ostendorf. Parsing-based objective functions

for speech recognition in translation applications. In Proc. ICASSP, 2008.

L. Huang. Forest reranking: Discriminative parsing with non-local features. In Proc. HLT,

pages 586–594, June 2008.

Z. Huang and M. Harper. Self-training PCFG grammars with latent annotations across

languages. In Proc. EMNLP, pages 832–841, Aug. 2009.

ISIP. Mississippi State transcriptions of SWITCHBOARD, 1997. URL http://www.isip.

msstate.edu/projects/switchboard/.

R. Iyer, M. Ostendorf, and J. R. Rohlicek. Language modeling with sentence-level mixtures.

In Proc. HLT, pages 82–87, 1994.

T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference

on Knowledge Discovery and Data Mining (KDD), 2006.

M. Johnson and E. Charniak. A tag-based noisy-channel model of speech repairs. In Proc.

ACL, pages 33–39, 2004.

J. G. Kahn. Moving beyond the lexical layer in parsing conversational speech. Master’s

thesis, University of Washington, 2005.

J. G. Kahn, M. Ostendorf, and C. Chelba. Parsing conversational speech using enhanced

segmentation. In Proc. HLT/NAACL, pages 125–128, 2004.

http://www.isip.msstate.edu/projects/switchboard/

http://www.isip.msstate.edu/projects/switchboard/

120

J. G. Kahn, M. Lease, E. Charniak, M. Johnson, and M. Ostendorf. Effective use of prosody

in parsing conversational speech. In Proc. HLT/EMNLP, pages 233–240, 2005.

J. G. Kahn, B. Roark, and M. Ostendorf. Automatic syntactic MT evaluation with expected

dependency pair match. In MetricsMATR: NIST Metrics for Machine Translation Chal-

lenge. NIST, 2008.

J. G. Kahn, M. Snover, and M. Ostendorf. Expected dependency pair match: predicting

translation quality with expected syntactic structure. Machine Translation, 23(2–3):169–

179, 2009.

A. Kannan, M. Ostendorf, and J. R. Rohlicek. Weight estimation for n-best rescoring. In

Proc. of the DARPA workshop on speech and natural language, pages 455–456, Feb. 1992.

M. King. Evaluating natural language processing systems. Communications of the ACM,

39(1):73–79, 1996.

P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc.

HLT/NAACL, pages 48–54, May–June 2003.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,

W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses:

Open source toolkit for statistical machine translation. In Proc. ACL, pages 177–180,

June 2007.

S. Kubler, R. McDonald, and J. Nivre. Dependency parsing. Synthesis Lectures on Human

Language Technologies, 2(1):1–127, 2009.

S. Lacoste-Julien, B. Taskar, D. Klein, and M. I. Jordan. Word alignment via quadratic

assignment. In Proc. HLT/NAACL, pages 112–119, June 2006.

L. Lamel, W. Minker, and P. Paroubek. Towards best practice in the development and

evaluation of speech recognition components of a spoken language dialog system. Natural

Language Engineering, 6(3&4):305–322, 2000.

LDC. Multiple translation Chinese corpus, part 2, 2003. Catalog number LDC2003T17.

121

LDC. Linguistic data annotation specification: Assessment of fluency and adequacy in

translations. http://projects.ldc.upenn.edu/TIDES/Translation/TransAssess04.

pdf, Jan. 2005.

LDC. Multiple translation Chinese corpus, part 4, 2006. Catalog number LDC2006T04.

LDC. GALE phase 2 + retest evaluation references, 2008. Catalog number LDC2008E11.

Z. Li, C. Callison-Burch, C. Dyer, S. Khudanpur, L. Schwartz, W. Thornton, J. Weese, and

O. Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In

Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139,

Mar. 2009.

C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summa-

rization Branches Out: Proc. ACL-04 Workshop, pages 74–81, July 2004.

D. Lin and C. Cherry. Word alignment with cohesion constraint. In Proc. NAACL, pages

49–51, 2003.

D. Liu and D. Gildea. Syntactic features for evaluation of machine translation. In Proc.

ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summa-

rization, pages 25–32, June 2005.

Y. Liu, Q. Liu, and S. Lin. Tree-to-string alignment template for statistical machine trans-

lation. In Proc. COLING/ACL, pages 609–616, July 2006a.

Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper. Enriching speech

recognition with sentence boundaries and disfluencies. IEEE Transactions on Speech,

Audio, and Language Processing, 14(5):1526–1540, 2006b.

J. T. Lønning, S. Oepen, D. Beermann, L. Hellan, J. Carroll, H. Dyvik, D. Flickinger, J. B.

Johannessen, P. Meurer, T. Nordgard, V. Rosen, and E. Velldal. LOGON. A Norwegian

MT effort. In Proc. Recent Advances in Scandinavian Machine Translation, 2004.

D. M. Magerman. Statistical decision-tree models for parsing. In Proc. ACL, pages 276–283,

1995.

http://projects.ldc.upenn.edu/TIDES/Translation/TransAssess04.pdf

http://projects.ldc.upenn.edu/TIDES/Translation/TransAssess04.pdf

122

L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word er-

ror minimization and other applications of confusion networks. Computer Speech and

Language, pages 373–400, 2000.

D. Marcu, W. Wang, A. Echihabi, and K. Knight. SPMT: Statistical machine translation

with syntactified target language phrases. In Proc. EMNLP, pages 44–52, July 2006.

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus

of English: the Penn treebank. Computational Linguistics, 19(1):313–330, Mar. 1993.

M. Meteer, A. Taylor, R. MacIntyre, and R. Iyer. Dysfluency annotation stylebook for the

switchboard corpus. Technical report, Linguistic Data Consortium (LDC), 1995.

Y. Miyao and J.-i. Tsujii. Feature forest models for probabilistic HPSG parsing. Computa-

tional Linguistics, 34(1):35–80, 2008.

R. C. Moore, W.-t. Yih, and A. Bode. Improved discriminative bilingual word alignment.

In Proc. ACL, pages 513–520, July 2006.

W. Naptali, M. Tsuchiya, and S. Nakagawa. Topic-dependent language model with voting

on noun history. ACM Transactions on Asian Language Information Processing (TALIP),

9(2):1–31, 2010.

NIST. NIST speech recognition scoring toolkit (SCTK). Technical report, NIST, 2005.

URL http://www.nist.gov/speech/tools/.

F. J. Och. Minimum error rate training in statistical machine translation. In Proc. ACL,

pages 160–167, July 2003.

F. J. Och and H. Ney. A systematic comparison of various statistical alignment models.

Computational Linguistics, 29(1):19–51, 2003.

K. Owczarzak, J. van Genabith, and A. Way. Evaluating machine translation with LFG

dependencies. Machine Translation, 21(2):95–119, June 2007a.

http://www.nist.gov/speech/tools/

123

K. Owczarzak, J. van Genabith, and A. Way. Labelled dependencies in machine translation

evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation,

pages 104–111, June 2007b.

S. Pado, D. Cer, M. Galley, D. Jurafsky, and C. Manning. Measuring machine transla-

tion quality as semantic equivalence: A metric based on entailment features. Machine

Translation, 23:181–193, 2009.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation

of machine translation. In Proc. ACL, pages 311–318, 2002.

S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In Proc. HLT, pages

404–411, Apr. 2007.

C. J. Pollard and I. A. Sag. Head-driven phrase structure grammar. Studies in contemporary

linguistics. Stanford: CSLI, 1994.

M. Popovic and H. Ney. POS-based word reorderings for statistical machine translation. In

Proc. LREC, pages 1278–1283, May 2006.

C. Quirk, A. Menezes, and C. Cherry. Dependency treelet translation: syntactically in-

formed phrasal SMT. In Proc. ACL, pages 271–279, 2005.

B. Roark. Probabilistic top-down parsing and language modeling. Computational Linguis-

tics, 27(2):249–276, June 2001.

B. Roark, M. Harper, E. Charniak, B. Dorr, M. Johnson, J. G. Kahn, Y. Liu, M. Ostendorf,

J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart, and L. Yung.

SParseval: Evaluation metrics for parsing speech. In Proc. LREC, 2006.

B. Roark, M. Saraclar, and M. Collins. Discriminative n-gram language modeling. Computer

Speech and Language, 21(2):373–392, Apr. 2007.

L. Shen, A. Sarkar, and F. J. Och. Discriminative reranking for machine translation. In

Proc. HLT/NAACL, pages 177–184, May 2004.

124

N. Singh-Miller and M. Collins. Trigger-based language modeling using a loss-sensitive

perceptron algorithm. In Proc. ICASSP, volume 4, pages 25–28, 2007.

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. A study of translation edit

rate with targeted human annotation. In Proc. AMTA, 2006.

M. Snover, N. Madnani, B. Dorr, and R. Schwartz. Fluency, adequacy, or HTER? Exploring

different human judgments with a tunable MT metric. In Proceedings of the Workshop

on Statistical Machine Translation at EACL, Mar. 2009.

A. Stolcke. Modeling linguistic segment and turn-boundaries for n-best rescoring of spon-

taneous speech. In Proc. Eurospeech, volume 5, pages 2779–2782, 1997.

A. Stolcke. SRILM – an extensible language modeling toolkit. In Proc. ICSLP, pages

901–904, 2002.

A. Stolcke and E. Shriberg. Automatic linguistic segmentation of conversational speech. In

Proc. ICSLP, pages 1005–1008, 1996.

A. Stolcke, B. Chen, H. Franco, V. R. R. Gadde, M. Graciarena, M.-Y. Hwang, K. Kirch-

hoff, A. Mandal, N. Morgan, X. Lei, T. Ng, M. Ostendorf, K. Sonmez, A. Venkataraman,

D. Vergyri, W. Wang, J. Zheng, and Q. Zhu. Recent innovations in speech-to-text tran-

scription at SRI-ICSI-UW. Audio, Speech, and Language Processing, IEEE Transactions

on, 14(5):1729–1744, Sept. 2006.

S. Strassel. Simple Metadata Annotation Specification V5.0. Linguistic Data Consortium,

2003. URL http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/SimpleMDE_

V5.0.pdf.

S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation.

In Proc. COLING, pages 836–841, Copenhagen, Denmark, 1996.

M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella. Evaluating interactive dialogue

systems: extending component evaluation to integrated system evaluation. In Interactive

http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/SimpleMDE_V5.0.pdf

http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/SimpleMDE_V5.0.pdf

125

Spoken Dialog Systems on Bringing Speech and NLP Together in Real Applications, pages

1–8, 1997.

W. Wang and M. P. Harper. The SuperARV language model: Investigating the effectiveness

of tightly integrating multiple knowledge sources. In Proc. EMNLP, pages 238–247, July

2002.

W. Wang, A. Stolcke, and M. P. Harper. The use of a linguistically motivated language

model in conversational speech recognition. In Proc. ICASSP, volume 1, pages 261–264,

2004.

B. Wong and C. Kit. ATEC: automatic evaluation of machine translation via word choice

and word order. Machine Translation, 23:141–155, 2009.

F. Xia and M. McCord. Improving a statistical MT system with automatically learned

rewrite patterns. In Proc. COLING, pages 508–514, 2004.

D. Xiong, Q. Liu, and S. Lin. A dependency treelet string correspondence model for statis-

tical machine translation. In Proceedings of the Second Workshop on Statistical Machine

Translation, pages 40–47, June 2007.

N. Xue, F.-D. Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In

Proc. COLING, 2002.

K. Yamada and K. Knight. A syntax-based statistical translation model. In Proc. ACL,

pages 523–530, July 2001.

A. Yeh. More accurate tests for the statistical significance of result differences. In Proc.

COLING, volume 2, pages 947–953, 2000.

Y. Zhang, R. Zens, and H. Ney. Chunk-level reordering of source language sentences

with automatically learned rules for statistical machine translation. In Proc. NAACL-

HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 1–8,

April 2007.

126

A. Zollmann, A. Venugopal, M. Paulik, and S. Vogel. The syntax augmented MT (SAMT)

system at the shared task for the 2007 ACL workshop on statistical machine translation.

In Proceedings of the Second Workshop on Statistical Machine Translation, pages 216–219,

June 2007.

127

VITA

Jeremy Gillmor Kahn was born in Atlanta, Georgia and has proceeded widdershins

around the continental United States: Providence, Rhode Island, where he received his AB

in Linguistics from Brown University; Ithaca, New York, where he discovered a career in

speech synthesis; Redmond and Seattle, Washington, where that career extended to include

speech recognition. He entered the University of Washington in Linguistics in 2003, receiving

an MA and (now) a Ph.D.

His counter-clockwise trajectory continues; Jeremy is employed by Wordnik, a Bay Area

computational lexicography company. He has a job where they pay him to think about

words and numbers and how they fit together.

Jeremy lives in San Francisco, California with his wife Dorothy, a dramatherapist. The

two of them spend a lot of time talking about what it means to say what you mean and

what it says to mean what you say.

Jeremy G. Kahn's PhD dissertation

Documents

Transcript of Jeremy G. Kahn's PhD dissertation