Evaluating Globally Normalized Transition Based …1105281/...On March 2016 Google published a...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Evaluating Globally Normalized Transition Based Neural Networks for Multilingual Natural Language Understanding

ANDREA AZZARONE

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Evaluating Globally Normalized Tran-sition Based Neural Networks for Mul-tilingual Natural Language Understand-ing

ANDREA AZZARONE

Master in Computer ScienceDate: June 2, 2017Supervisor: Lluís PadróExaminer: Olov EngwallSchool of Computer Science and Communication

i

Abstract

We analyze globally normalized transition-based neural network models for dependencyparsing on English, German, Spanish, and Catalan. We compare the results with FreeLing,an open source language analysis tool developed at the UPC natural language processingresearch group. Furthermore we study how the mini-batch size, the number of units inthe hidden layers and the beam width affect the performances of the network. Finally wepropose a multi-lingual parser with parameters sharing and experiment with German andEnglish obtaining a significant accuracy improvement upon the monolingual parsers. Thesemulti-lingual parsers can be used for low-resource languages of for all the applications withlow memory requirements, where having one model per language in intractable.

Acknowledgement

I would like to show my greatest appreciation to Prof. Lluís Padró whose comments andsuggestions were innumerably valuable throughout the course of my study. Special thanksalso go to Prof. Olov Engwall for his extraordinarily tolerance and support. Thanks toCanonical LTD that allowed me to work and study at the same time, to all my colleaguesand my managers for their support and for all the teachings, a big hug here goes to Marco.I am deeply grateful to all the friends that shared this amazing journey with me. Thanks toVicho, Berni (and Kala) for being the best roommates ever. Thanks to my brothers Giovannie Domenico for being there when needed. Last but not least I would also like to express mygratitude to my parents for their moral support and warm encouragements.

ii

Contents

Contents iii

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Intended readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 This thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Delimitations and Limitations . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.3 Ethical Concerns and Sustainability . . . . . . . . . . . . . . . . . . . . 31.3.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Dependency Parsing 42.1 Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Dependency Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Transition-Based Dependency Parsing 63.1 The Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Arc-standard Transition System . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Arc-Eager Transition System . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 The Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Neural Network Models for Transition-Based Dependency Parsing 94.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.1 Part Of Speech (PoS) and Label Embeddings . . . . . . . . . . . . . . . 104.2 Greedy Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Structured Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Recurrent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 SyntaxNet 165.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.1.2 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.1.3 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.1.4 Global vs. Local Normalization . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iii

iv CONTENTS

6 Evaluation of SyntaxNet on Multilingual Tasks 206.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1.1 EAGLES PoS Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 Model Configuration and Training . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Evaluating SyntaxNet Parameters 257.1 Mini-Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.2 Hidden Layers Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.3 Beam Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8 Building a Bilingual Parser 318.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8.2.1 Bilingual Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 328.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8.5.1 Effect of Corpus Repetition . . . . . . . . . . . . . . . . . . . . . . . . . 368.5.2 Effect of Pre-trained Word Embeddings . . . . . . . . . . . . . . . . . . 36

9 Conclusion 389.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 40

Abbreviations

ACC Label Accuracy. 23

CRF Conditional Random Field. 13, 18

LAS Labeled Attachment Score. 23

LSTM Long Short-Term Memory. 14

MTL Multi-Task Learning. 31

NLP Natural Language Processing. 1, 9, 13–15, 32, 33

NLU Natural Language Understanding. 1, 5

PoS Part Of Speech. iii, iv, 5, 9–11, 15, 20, 21, 23

UAS Unlabeled Attachment Score. 23, 28, 35, 36

v

Chapter 1

Introduction

1.1 Background

Natural Language Processing (NLP) refers to the use and ability of systems to process sen-tences in a natural language such as English, rather than in a specialized artificial computerlanguage such as C++. Interest in NLP began in 1950 when Alan Turing published his pa-per entitled "Computing Machinery and Intelligence", from which the so-called Turing Testemerged. Turing basically asserted that a computer could be considered intelligent if it couldcarry on a conversation with a human being without the human realizing they were talkingto a machine.

Natural Language Understanding (NLU) is a subtopic of NLP in artificial intelligence thatdeals with machine reading comprehension. NLU is considered an hard problem. One of themain problems that makes NLU so challenging is that human languages show remarkablelevels of ambiguity. The grammar for natural languages is ambiguous and typical sentenceshave multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence theremay be thousands of potential parses (most of which will seem completely nonsensical to ahuman).

Dependency-based approaches can be used to resolve such ambiguities. In recent yearsthere has been an increasing interest in neural network models for dependency parsing.These models present all advantages and disadvantages that neural networks models usu-ally present: training these types of parses does not require a lot of expertise, but to obtainstate-of-art results a considerable amount of annotated data is required.

On March 2016 Google published a globally normalized transition-based neural networkmodel (SyntaxNet) that achieves state-of-the-art part-of speech tagging, dependency parsingand sentence compression results (Andor et al., 2016).

1.2 Intended readers

In this thesis, dependency parsing and neural networks models for dependency parsingare the main topics. A reader interested in either of these topics may find the results pre-

1

2 CHAPTER 1. INTRODUCTION

sented useful for further study. A casual reader will be presented with an introduction todependency parsing, neural networks model for dependency parsing and SyntaxNet, themodel that we aim to study and extend.

1.3 This thesis

We aim to understand how well globally normalized transition-based neural networks fordependency parsing perform on four different languages (English, German, Spanish andCatalan) and how the choice of parameters affects the performances.

We also aim to understand if you can somehow exploiting the similarities among two ormore languages to train more robust models, jointly training a dependency parser for twolanguages at the same time. We also try to understand if the same methodology can be usedfor low resource languages, for which small annotated corpora are available.

We consider worthwhile this work as to our knowledge there is not yet an extensive studyof SyntaxNet’s parameters. This study can be used by the research community or by anyoneelse interested in training a SytaxNet model to tweak the parameters and to avoid a time-expensive parameters search to improve the accuracies. We also believe that the work on thebilingual parser can be used to get competitive accuracies for low-resource languages or inall the cases where we need to deal with more than one language and we cannot afford towaste memory and storage resources. It is indeed clear that having one model for languageis more resources-demanding than having a single multi-lingual model. Resources used notto be a problem when inference was done on powerful systems, but nowadays we see moreand more interest in performing inference directly in embedded devices.

The main contributions of this thesis are: an extended study on how SyntaxNet’s parame-ters affect the performance of the model, and a novel multilingual parser based on SyntaxNetand trained on more languages.

1.3.1 Delimitations and Limitations

Considering the limited amount of computations resources available we limit the study ofSyntaxNet to four languages: English, German, Spanish and Catalan. We also limit the studyof the multilingual parser to two languages, German and English, performing a reduced butstill valid training, as explained in more details in Chapter 8.

1.3.2 Related works

To our knowledge this is the first extended study of SyntaxNet’s parameters. Even if An-dor et al. (2016) extensively introduce SyntaxNet and its background and motivation, thepresented results are limited to a small set of parameters.

On the other side this is not the first attempt to build a multilingual parser. Zeman andResnik (2008) use delexicalized parsing, in which a parser is built without any lexical fea-tures and trained on a treebank for a resource-rich source language. The delexicalized parseris then applied directly to parse sentences in the target resource-poor languages. This exploit

CHAPTER 1. INTRODUCTION 3

the idea that identical part-of-speech (POS) are highly informative of dependency relations,and that there exists shared dependency structures across languages. Petrov et al. (2016) andMa and Xia (2014) exploit parallel data as the bridge to transfer constraints from the sourceresource-rich language to the target resource-poor languages. Täckström et al. (2012) useparallel data to induce cross-lingual word clusters adding them as features for their delexi-calized parser. Durrett et al. (2012) constructed a set of language-independent features andused a bilingual dictionary to transfer these features from source to target language. Duonget al. (2015) proposed a model for learning a shared "universal" parser that operates over aninterlingual continuous representation of language, along with language-specific mappingcomponents. They also incorporate a bilingual dictionary as a set of soft constraints on themodel, such that it learn similar representations for each word and its translation(s).

Our approach is similar to the one of Duong et al. (2015) but we use pretrained multilin-gual word-embeddings instead of a bilingual dictionary, and we use as a baseline modelSyntaxNet.

1.3.3 Ethical Concerns and Sustainability

Even if artificial intelligence is raising more and more ethical questions in the last fewyears, we cannot see any ethical concern in this thesis.

1.3.4 Outline of the thesis

From Chapter 2 till Chapter 5 we progressively introduce the reader to Dependency Pars-ing, Transition-Based Dependency Parsing, Neural Network Models for Dependency Pars-ing and SyntaxNet.

The last two chapters contains the main contributions of our thesis. In Chapter 6 we de-scribe how we trained several SyntaxNet models for multi-lingual dependency parsing. InChapter 7 we study how the choice of SyntaxNet’s parameters affects the performances ofthe models. In Chapter 8 we describe our experiments to build and train jointly bilingualparsers based on the idea that we can exploit similarities between languages.

Chapter 2

Dependency Parsing

Dependency-based syntactic theories are based on the idea that syntactic structure can beanalyzed in terms of binary, asymmetric dependency relations holding between the wordsof a sentence (Nivre, 2008). Dependency parsing proved its usefulness in many applicationssuch as Question Answering (Jijkoun et al., 2004; Cui et al., 2005a; Bouma et al., 2006; Cui etal., 2005b), Machine Translation (Chris Quirk, 2006; Ding and Palmer, 2005; Shen et al., 2008),Information Extraction (Ciravegna and Lavelli, 1999; Sætre et al., 2008; Alphonse et al., 2004)and Natural Language Generation (Koller and Striegnitz, 2002) among others.

2.1 Dependency Graph

The syntactic structure of a sentence is modeled by a dependency graph, which repre-sents each word and its syntactic dependents through labeled directed arcs as exemplified inFigure 2.1 for an English sentence taken from the Penn Tree-bank (Marcus et al., 1993). An ar-tificial word ROOT has been inserted at the beginning of each sentence, serving as the uniqueroot of the graph. This is a standard technique that simplifies both theoretical definitions andcomputational implementations. More formally:

Definition 2.1.1. Given a set L = {l1, . . . , l|L|} of dependency labels, a dependency graphfor a sentence x = {w0, w1, . . . , wn} is a labeled directed graphG = (V ,A), where

1. V = {0, 1, . . . , n} is a set of nodes,

2. A ⊆ V ×L× V is a set of labeled directed arcs.

The setV of nodes (or vertices) is the set of non-negative integers corresponding to the linearposition of a word in the sentence (including ROOT). The set A of arcs (or directed edges) isa set of ordered triples (i, l, j), where i and j are nodes and l is a dependency label. We saythat i is the head, l is the dependency type of j, and j is a dependent of i (Nivre, 2008).

Definition 2.1.2. A dependency graphG = (V ,A) is well-formed if and only if:

1. The node 0 is a root, that is, there is no node i and label l such that (i, l, 0) ∈ A.

2. Every node has at most one head and one label, that is, if (i, l, j) ∈ A then there is nonode i′ and label l′ such that (i′, l′, j) ∈ A and i 6= i′ or j 6= j′.

4

CHAPTER 2. DEPENDENCY PARSING 5

Figure 2.1: Dependency graph for an English sentence from the Penn Tree-bank.

3. The graphG is acyclic, that is, there is no (non-empty) subset of arcs{(i0, l1, i1), (i1, l2, i2), . . . , (ik−1, lk, ik)} ⊆ A such that i0 = ik.

Any dependency graph satisfying these conditions is a dependency forest; if it is also con-nected, it is a dependency tree, that is, a directed tree rooted at the node 0. It is worth notingthat any dependency forest can be turned into a dependency tree by adding arcs from thenode 0 to all other roots.

2.2 Dependency Parsers

In computational linguistics, a dependency parser is a device that produces a dependencygraph G = (V ,A) from an input sentence x = w0, . . . , wn and eventually other inputs, e.g.PoS tags.

Dependency-based syntactic representations have in recent years been used primarily indata-driven models, which learn to produce dependency structures for sentences solely froman annotated corpus. One potential advantage of such models is that they are easily portedto any domain or language in which annotated resources exist.

In the course of our work we made use of two dependency parsers, SyntaxNet and FreeL-ing. SyntaxNet is an open-source neural network framework implemented in TensorFlowthat provides a foundation for NLU systems (Andor et al., 2016). We’ll talk more about it inChapter 5.

FreeLing is an open source language analysis tool suite created and currently leaded byLluís Padró as a mean to make available to the community the results of the research carriedout at the UPC natural language processing research group (Padró and Stanilovsky, 2012).FreeLing provides both a C++ library and a command-line front-end that can be used to ana-lyze text and obtain output in the desired format (XML, JSON, CoNLL). It offers several lan-guage analysis functionalities (morphological analysis, named entity detection, PoS-tagging,parsing, Word Sense Disambiguation, Semantic Role Labeling, etc.) for a variety of languages(English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Croatian, Slovene,among others). For dependency parsing FreeLing offers both rule-based and statistical mod-els. We used the statistical one based on (Carreras, 2007). The statistical dependency parseruses the Treeler1 machine learning library.

1http://devel.cpl.upc.edu/treeler/

Chapter 3

Transition-Based Dependency Parsing

Transition-based dependency parsers scan an input sentence from left to right, and per-form a sequence of transition actions to predict its parse tree (Nivre, 2008). Transition-baseddependency parsers are usually defined in terms of two components, a transition systemand an oracle.

3.1 The Transition System

Definition 3.1.1. A transition system for dependency parsing is a quadrupleS = (C,T , cs,Ct),where

1. C is a set of configurations (or parser states), each of which contains a buffer β of(remaining) nodes and a setA of dependency arcs,

2. T is a set of transitions, each of which is a function t : C → C,

3. cs is an initialization function, mapping a sentence x = (w0, w1, . . . , wn) to a configura-tion with β = [1, . . . , n],

4. Ct ⊆ C is a set of terminal configurations.

A configuration is required to contain at least a buffer β, initially containing the nodes[1, . . . , n] corresponding to the real words of a sentence x = (w0, w1, . . . , wn), and a set A ofdependency arcs, defined on the nodes in V = {0, 1, . . . , n}, given some set of dependencylabels L. Specific transition systems extend this basic notion of configuration with differentdata structures, such as stacks and lists. We use the notation βc and Ac to refer to the valueof β and A, respectively, in a configuration c; we also use |β| to refer to the length of β (i.e.,the number of nodes in the buffer) and we use [] to denote an empty buffer.

Definition 3.1.2. Let S = (C,T , cs,Ct) be a transition system. A transition sequence for asentence x = (w0, w1, . . . , wn) in S is a sequence C0:m = (c0, c1, . . . , cm) of configurations,such that

1. c0 = cs(x),

2. cm ∈ Ct,

3. for every i (1 ≤ i ≤ m), ci = t(ci−1) for some t ∈ T .

6

CHAPTER 3. TRANSITION-BASED DEPENDENCY PARSING 7

Table 3.1: An example of transition-based dependency parsing using the arc-standard rules.

Transition Stack Buffer A

[ROOT] [He has good control .]SHIFT [ROOT He] [has good control .]SHIFT [ROOT he has] [good control .]LEFT-ARC (nsubj) [ROOT has] [good control .] A[nsubj(has, He)]SHIFT [ROOT has good] [control .]SHIFT [ROOT has good control] [.]LEFT-ARC (amod) [ROOT has control] [.] A[ amod(control, good)RIGHT-ARC (dobj) [ROOT has] [.] A[ dobj(has, control)SHIFT [ROOT has .] []RIGHT-ARC (punct) [ROOT has] [] A[ punct(has, .)RIGHT-ARC (root) [ROOT] [] A[ root(ROOT, has)

The parse assigned to x by C0:m is the dependency graph Gcm = ({0, 1, . . . , n},Acm),where Acm is the set of dependency arcs in cm. Starting from the initial configuration forthe sentence to be parsed, transitions will manipulate β and A (and other available datastructures) until a terminal configuration is reached. Because the node set V is given bythe input sentence itself, the set Acm of dependency arcs in the terminal configuration willdetermine the output dependency graph Gcm = (V ,Acm). We can assume a one-to-onemapping between decision sequences t0:j−1 and configurations cj : that is, we essentiallyassume that a state encodes the entire history of decisions and each state can be reached bya unique decision sequence from c0.

Two of the most common used transition system for dependency parsing are the arc-standard and the arc-eager.

3.1.1 Arc-standard Transition System

In the arc-standard, a configuration c = (s, b,A) consists of a stack s, a buffer b, and a setof dependency arcsA. For an input sentence w0, . . . , wn−1 parsing starts with an s = [ROOT], abuffer consisting of the while input sentence b = [w0, . . . , wn−2]], andA = ∅. A configurationc is terminal if the buffer is empty and the stack contains the single node ROOT, and theparse tree is given by Ac. At each step, a transition action is taken to consume the input andconstruct the output:

• SHIFT: moves the front word b0 from the buffer onto the stack.

• LEFT-ARC(l): add an arc s0 → s1 with label l and removes s1 from the stack.

• RIGHT-ARC(l): add an arc s1 → s0 with label l and removes s0 from the stack.

Table 3.1 illustrates an example of one transition sequence from the initial configuration toa terminal one for the sentence "He has good control".

3.1.2 Arc-Eager Transition System

In the arc-eager, a configuration c = (s, b,A) consists of a stack s, a buffer b, and a set ofdependency arcs A. For an input sentence w0, . . . , wn−1 parsing starts with an s = [ROOT], a

8 CHAPTER 3. TRANSITION-BASED DEPENDENCY PARSING

buffer consisting of the while input sentence b = [w0, . . . , wn−1], andA = ∅. A configurationc is terminal if the buffer is empty, and the parse tree is given byAc. At each step, a transitionaction is taken to consume the input and construct the output:

• SHIFT: moves the front word b1 from the buffer onto the stack.

• REDUCE: pops the stack.

• LEFT-ARC(l): add an arc w0 → s0 with label l and removes s0 from the stack.

• RIGHT-ARC(l): add an arc s0 → w0 with label l and pushes w0 onto the stack.

The arc-eager system always adds an arc at the earliest possible opportunity and it there-fore builds parts of the tree top-down instead of bottom-up. It doesn’t ensure well-formedness.

3.2 The Oracle

Definition 3.2.1. An oracle for a transition system S = (C,T , cs,Ct) is a function o : C → T .

Deterministic parsing can be achieved by the Algorithm 1. It is easy to see that, providedthat there is at least one transition sequence in S for every sentence, such a parser constructsexactly one transition sequence C0:m for a sentence x and returns the parse defined by theterminal configuration cm, that is, Gcm = ({0, 1, . . . , n − 1}, Acm). Such types of oracles arecalled deterministic oracles.

Algorithm 1: Oracle parsing algorithmData: x = (w0, w1, . . . , wn−1)

Result: Gc1 c← cs;2 while c /∈ Ct do3 c← [o(c)](c)

4 end

In data-driven dependency parsing, oracles normally take the form of classifiers (Nivre,2008). In the last few years oracles have been usually implemented using neural networks.

Over-generalizing we can classify them as:

• local - e.g. at each step a set of successor states are generated according to the currentstate and a SVM classifier is used to deterministically selects the highest-scored one asthe next state so behaving as a local classifier.

• structured - the oracle optimize whole sequences of actions, which correspond to treestructures.

Chapter 4

Neural Network Models for Transition-Based Dependency Parsing

Until few years ago, transition-based dependency parsers described in Chapter 3 wereimplemented using rule-based oracles. Such types of parsers present several weaknesses:

• From a statistical perspective, they suffer from the use of millions of mainly poorlyestimated feature weights.

• They usually rely on a manually designed set of feature templates, which require a lotof expertise and are usually incomplete.

• The use of many feature templates means that most of the runtime is consumed not bythe core parsing algorithm but in the feature extraction step.

Neural network models for transition-based dependency parsing address all the theseproblems. Chen and Manning (2014) use dense features in place of the sparse indicatorfeatures as inspired by the success of distributed word representations in many NLP tasks.Furthermore a neural network classifier is trained to make parsing decisions. The neural net-work learns dense vector representation of words, PoS tags, and dependency labels. Chenand Manning (2014) introduced for the first time PoS tag and arc label embeddings insteadof discrete representations.

In the next few pages we are going to give a short introduction to word embeddings, whatthey are and why in the few past years there has been so much interest in them. Furthermorewe’ll briefly describe the greedy and the structured architectures.

4.1 Word Embeddings

First, a note on terminology. Word embedding seems to be the dominating term at the mo-ment even if in computational linguistics the term distributional semantic model is usuallypreferred (since the underlying semantic theory is called distributional semantics). Thereare other alternative terms in use: distributed representation, semantic vector space or sim-ply word space. For consistency, we will stick to current practice and use the term wordembedding.

9

10 CHAPTER 4. NEURAL NETWORK MODELS FOR TRANSITION-BASED DEPENDENCYPARSING

Figure 4.1: Example of encoded semantic relationships in word embeddings.

Literally word embedding is the set of language modeling and feature learning techniquesin natural language processing where words or phrases from the vocabulary are mapped tovectors of real numbers. Conceptually it involves a mathematical embedding from a spacewith one dimension per word to a continuous vector space with much lower dimension.

So why the necessity to introduce word embeddings? Image and audio processing sys-tems work with rich, high-dimensional datasets encoded as vectors of the individual rawpixel-intensities for image data, or power spectral density coefficients for audio data. Fortasks like object or speech recognition we know that all the information required to success-fully perform the task is encoded in the data (because humans can perform these tasks fromthe raw data). However, natural language process systems traditionally treat words as dis-crete atomic symbols, so as very large sparse vectors. These encodings are arbitrary, andprovide no useful information regarding the relationships that may exist between the indi-vidual symbols. So for example, this means that the model can leverage very little of what ithas learned about ’cats’ when it is processing data about ’dogs’ (such that they are both an-imals, four-legged, etc.). Representing words as unique, discrete symbols furthermore leadsto data sparsity, and usually means that we may need more data in order to successfully trainstatistical models. Using vector representations can overcome some of these obstacles.

Even if word embedding have been used in distributional semantics since the 1990s, theeventual popularization of word embeddings can be attributed to Mikolov et al. (2013) withthe introduction of word2vec, a toolkit that allows the training and use of pre-trained em-beddings, and showed that somewhat surprisingly, the similarity of word representationsgoes beyond simple syntactic regularities. Using a word offset technique where simplealgebraic operations are performed on the word vectors, it was shown for example thatvector(”King”) − vector(”Man”) + vector(”Woman”) results in a vector that is closest tothe vector representation of the word Queen as shown in picture 4.1.

4.1.1 PoS and Label Embeddings

Although the PoS tags and arc labels are relatively small discrete sets, they still manifestmany semantical similarities like words. For example, NN (singular noun) should be closerto NNP (Proper Noun) than JJ (Adjective), and amod (adjective modifier) should be closer to

CHAPTER 4. NEURAL NETWORK MODELS FOR TRANSITION-BASED DEPENDENCYPARSING 11

Figure 4.2: t-SNE visualization of PoS and label embeddings. Similar PoS and labels are close in theembedding space. Source Chen and Manning (2014).

poss (possession modifier) than dobj (direct object). As first shown by Chen and Manning(2014), the learned dense representations of PoS tags and arc labels correctly carry thesesemantical similarities. Figure 4.2 presents t-SNE visualizations of these embeddings.

4.2 Greedy Neural Network Model

Figure 4.3 describes the neural network architecture introduced by Chen and Manning(2014) that still serves as baseline for more recent works, including SytanxNet. As statedabove each word, PoS tags and arc labels are represented as a d-dimensional vector. A set ofelements are chosen based on the stack/buffer position for each type of information (word,PoS or arc label), which might be useful for the prediction. A good advantage of this kind ofparser it that we can add a rich set of elements cheaply, instead of hand-crafting many moreindicator features.

A standard neural network with one hidden layer is built, where the corresponding em-beddings of the chosen elements (words, PoS tags and labels) are added to the input layer.The input layer is mapped to the hidden layers through a cube activation function. A softmaxlayer is finally added on the top of the hidden layer for modeling multi-class probabilities.

Training examples are first generated from the training sentences and their gold parse treesusing a deterministic "shortest stack" oracle which always prefers LEFT-ARC over SHIFT.The final training objective is to minimize the cross-entropy loss, plus a l2-regularizationterm to reduce overfitting. A slight variation is that we compute the softmax probabilitiesonly among the feasible transitions in practice.

At parsing time the model performs greedy decoding. At each step, all the correspondingword, PoS tag and label embeddings from the current configuration are extracted, the hiddenlayer activation is computed and the transition with highest score is picked.


Figure 4.3: Neural network architecture used in Chen and Manning (2014) .

4.3 Structured Neural Network Model

The downside of greedy (or local) parsing described above is error propagation: an incor-rect action will have a negative impact to the subsequent one, leading to an incorrect outputparse tree. To address this issue global structural learning can be used.

• In the aspect of decoding, beam-search can be applied to improve upon greedy one-best search.

• In the aspect of training, global structural learning can be used to replace local learningon each decision.

Beam search is a restricted version of either a breadth-first search or a best-first search. It isrestricted in the sense that non-promising nodes can be pruned at any step in the search(Zhang, 1999). The pruning of non-promising nodes is determined by problem-specificheuristics (Zhang, 1999). The set of most promising, or best alternative, search nodes iscalled the "beam" (Xu and Fern, 2007). Essentially, beam search is a forward-pruning, heuris-tic search.

Global structural learning maximizes the likelihood of action sequences instead of indi-vidual actions. Given an input x, the goal is to find the highest-scored action sequence glob-ally.

Structured prediction or structured learning is an umbrella term for supervised machinelearning techniques that involves predicting structured objects, rather than scalar discreteor real values. For example, the problem of translating a natural language sentence into asyntactic representation such as a parse tree can be seen as a structured prediction problemin which the structured output domain is the set of all possible parse trees.

One of the easiest ways to understand algorithms for general structured prediction is thestructured perceptron (Collins, 2002). The task is to learn a mapping from inputs x ∈ X tooutputs y ∈ Y . We assume:


• Training epochs T

• Training examples (xi, yi) for i = 1 . . . n.

• A function GEN which generate candidate prediction for an input x.

• A representation Φ mapping each (x, y) ∈X × Y to a feature vector Φ(x, y) ∈ Rd.

• A parameter vector ~w ∈ Rd.

The components GEN , Φ and ~w define a mapping from an input x to an output F (x)

through:

F (x) = arg maxy∈GEN(x)

Φ(x, y) · ~w

where Φ(x, y) · ~w is the inner product. The learning task is to set the parameters values ~wusing the training examples as evidence. Algorithm 2 shows a way to set the weights ~w.

Algorithm 2: The structured perceptron learning algorithmData: training examples (xi, yi)

Result: ~w1 ~w ← 0;2 for t← 1 to T do3 for i← 1 to N do4 zi ← arg maxz∈GEN(x) Φ(xi, z) · ~w;5 if zi 6= yi then6 ~w ← ~w + Φ(xi, yi)− Φ(wi, zi);7 end8 end9 end

The structured perceptron has been successfully applied to NLP tasks. The most com-mon used algorithm used is shown in Algorithm 3. On each training example, beam search(DECODE) is used to decode the sentence until the gold-standard parse tree falls out of thebeam. Beam search requires a score function for each parser state. This is usually calculatedrunning a feed-forward network with the state as input (that’s to say the set of features thatdescribe each parser state). The last layer (the soft-max) of the feed-forward network is usu-ally the distinguishing feature: e.g. Zhou et al. (2015) use a contrastive probabilistic learningobjective, and Andor et al. (2016) introduce a Conditional Random Field (CRF) objective. TheUPDATE function is normally a stochastic gradient step taken on the particular objective.

Weiss et al. (2015) do not use a stochastic gradient step for the "update" function neithera special soft-max for decoding, instead a feed-forward network is first trained as by Chenand Manning (2014) and then a perceptron layer is trained using the standard structuredperceptron algorithm where the representation function Φ is computed by the trained feed-forward network as show in Figure 4.4.


Algorithm 3: Training Algorithm for Structured Neural ParsingData: training examples (X,Y )

Result: θ1 initialize θ;2 for t← 1 to T do3 foreach i← 1 to N do4 beam = ∅;5 goldState = null;6 terminate = null;7 beamGold = true;8 while beamGold and not terminate do9 beam = DECODE(beam, xi, zi);

10 goldState = GOLD_MOV E(goldState, xi, zi);11 if not IS_GOLD(beam) then12 beamGold = false;13 end14 if ITEMS_COMPLETE(beam) then15 terminate = true;16 end17 end18 θ = θ + UPDATE(goldState, beam);19 end20 end

Figure 4.4: Schematic overview of Weiss et al. (2015) neural network model.

4.4 Recurrent Models

In the last few years there has been an increasing interest in recurrent neural networksmodels for NLP tasks towards end-to-end models. Variants of Long Short-Term Memory


(LSTM) networks have produced impressive results on some of the classical NLP tasks in-cluding PoS tagging and dependency parsing.

Chapter 5

SyntaxNet

In Chapter 4 we introduced the concept of structured neural network models for transitionbased parsing along with the abstract training algorithm. The beam search heuristic requiresto score each parser state. The scoring function can be defined in a number of ways. Follow-ing Zhou et al. (2015) and Weiss et al. (2015), Andor et al. (2016) defines it via a feed-forwardneural network as:

p(s, d; θ) = Φ(s; θ(l)) · θ(d)

where:

• θ(l) are the parameters of the neural network, excluding the parameters at the finallayer.

• θ(d) are the final layer parameters for decision d.

• Φ(s; θ(l)) is the representation for state s computed by the neural network under pa-rameters θ(l).

Note that the score is linear in the parameters θ(d). The different models in the literatureusually share the first layers of the feed-forward neural network but differ in the choice ofthe last layer, normally a softmax-style one. In the next few subsections we’ll briefly describethe architecture of SyntaxNet shown in Figure 5.1.

5.1 The model

5.1.1 Input Layer

Given a parse configuration c (consisting of a stack s and a buffer b), a rich set of discretefeatures F is extracted which and fed into the neural network. These features are groupedby their input source: e.g. words, POS tags, and arc labels. The features extracted for eachgroup are represented as a sparse F × V matrix X , where V is the size of the vocabulary ofthe feature group and F is the number of features. The value of element Xfv is 1 if the f ’thfeature has value v. For all feature groups, additional special values are added:

• ROOT indicating the POS or word of the root token.

• NULL indicating no valid feature value could be computed.

• UNK indicating an out-of-vocabulary item.

16

CHAPTER 5. SYNTAXNET 17

Figure 5.1: Feed-Forward SyntaxNet architecture (source Google).

5.1.2 Embedding Layer

The first learned layer h0 in the network transforms the sparse features X into an embed-ded representation. For each feature group Xg, we learn a Vg × Dg embedding matrix Eg

that applies the conversion:

h0 = [XgEg|g ∈ word, tag, ...]

the computation is applied separately for each group g and the results concatenated.Thus, the embedding layer has E =

∑gFgDg outputs, which are reshaped to a vector h0.

The embedding dimensionality D can be chosen freely for each group.

5.1.3 Hidden Layers

Two hidden layers composed of M rectified linear (ReLu) units are used. This differs fromthe feed-forward network of Chen and Manning (2014) that used a novel cube activationfunction. Each unit in the hidden layers is fully connected to the previous layer:

hi = max{0,Wihi−1}+ bi

whereW1 is aM1×E weight matrix for the first hidden layer andWi areMi×Mi−1 matricesfor all subsequent layers. The weights bi are bias terms.

5.1.4 Global vs. Local Normalization

For greedy neural network parsing, the conditional probability distribution over decisiondj given context d1:j−1 is defined as

p(dj |d1:j−1; θ) =exp p(d1:j−1, dj ; θ)

ZL(d1:j−1; θ)(5.1)

18 CHAPTER 5. SYNTAXNET

whereZL(d1:j−1; θ) =

∑d′∈A(d1:j−1)

exp p(d1:j−1, d′; θ)

where A(d1:j−1) is the set of allowed decisions from the state si. Each ZL(d1:j−1 is a localnormalization term. The probability of a sequence of decisions d1:n is

pL(d1:n) =n∏j=1

p(dj |d1:j−1; θ)

=exp

∑nj=1 p(d1:j−1, dj ; θ)∏n

j=1 ZL(d1:j−1; θ)

(5.2)

Beam search can be used to attempt to find the maximum of Equation 5.2 with respect tod1:n. The additive scores used in beam search are the log-softmax of each decision, ln p(dj |d1:j−1),not the raw scores p(dj |d1:j−1).

In contrast, a CRF defines a distribution pG(d1:n) as follows:

pG(d1:n) =exp

∑nj=1 p(d1:j−1, dj ; θ)

ZG(θ)(5.3)

where

ZG(θ) =∑

d′1:n∈Dn

exp

n∑j=1

p(d′1:j−1, d′j ; θ)

and Dn is the set of all valid sequences of decisions of length n. ZG(θ) is a global normal-ization term. The inference problem is to find:

arg maxd1:n∈Dn

pG(d1:n) = arg maxd1:n∈Dn

n∑j=1

p(d1:j−1, dj ; θ)

Beam search is used to approximately find the argmax.

5.2 Training

Training data consists of input x paired with gold decision sequences d∗1:n. AveragedStochastic gradient descent with momentum is used on the negative log-likelihood of thedata under the model to update the parameters. Under a locally normalized model, thenegative log-likelihood is

Llocal(d∗1:n; θ) = − ln pL(d∗1:n; θ)

= −n∑j=1

p(d∗1:j−1, d∗j ; θ) +

n∑j=1

lnZL(d∗1:j−1; θ)(5.4)

while under a globally normalized model it is:

Lglobal(d∗1:n; θ) = − ln pG(d∗1:n; θ)

= −n∑j=1

p(d∗1:j−1, d∗j ; θ) + lnZG(θ)

(5.5)

CHAPTER 5. SYNTAXNET 19

considering that the ZG is in many case intractable, beam search and early updates areused. As per algorithm 3, if the gold path falls out of the beam at step j, a stochastic gradientstep is taken with the following objective:

Lglobal−beam(d∗1:j ; θ) = −j∑i=1

p(d∗1:i−1, d∗i ; θ) + ln

∑d′1:j∈Bj

exp

j∑i=1

p(d′1:i−1, d′i; θ)

Here the set Bj contains all paths in the beam at step j. If the gold path remains in thebeam at the end of decoding, a gradient step is performed using Bn, the beam at the end ofdecoding.

While directly optimizing the global model defined by Equation 5.5 works well, it hasbeen seen that training the model in two steps achieves the same precision much faster: thenetwork is first pretrained using the local objective given in Equation 5.4, and then anotheraddition step is performed using the global objective given in Equation 5.5.

Chapter 6

Evaluation of SyntaxNet on Multilin-gual Tasks

We evaluated SyntaxNet on 4 different languages (Catalan, German, English and Spanish)and compared our models with FreeLing. Even if SyntaxNet can be used both as a PoS taggerand as a dependency parser, in our work we focused on the latter part. Please note that herewe trained and evaluated SyntaxNet for each language separately. In chapter 8 we describeour experiments to train the model jointly for two languages.

6.1 Corpora

To train and evaluate the models we used a modified version of CoNLL datasets pro-vided by the Polytechnic University of Catalonia. The corpora provided by UPC includedjust training and tuning datasets. In order to properly evaluate the trained models we ran-domly extracted from the training sets a set of sentences to be used during testing. The exactstatistics are shown in Table 8.1.

Table 6.1: Number of documents for dataset.

# training # tuning # test

English 37945 1334 1334German 31591 1828 1828Spanish 12651 1654 1654Catalan 11414 1709 1709

6.1.1 EAGLES PoS Tags

The corpora provided by UPC does not use the Universal PoS tags. Instead it uses PoStags based on the proposals by EAGLES. EAGLES intends to be able to encode all existingmorphological features for most European languages.

EAGLES PoS tags consist of variable-length labels where each character corresponds toa morphological feature. The first character in the tag is always the category (PoS). The

20

CHAPTER 6. EVALUATION OF SYNTAXNET ON MULTILINGUAL TASKS 21

category determines the length of the tag and the interpretation of each character in thelabel.

For instance, for the category noun we could have the definition:

Position Attribute Values

0 category N:noun1 type C:common; P:proper2 gen F:f; M:m; C:c2 num S:s; P:p; N:n

This would allow PoS tags such as NCFP (standing for noun/common/feminine/plural). Fea-tures that are not applicable or underspecified are set to 0 (zero). E.g. NC00 stands fornoun/common/underspecified-gender/underspecified-number. The interpretation of a character ata certain position of a tag depends on the PoS (indicated by the first character) and on thetarget language. E.g. for languages where nouns can have additional features (case, etc.) thetag definition would include one additional position.

Even if SyntaxNet is flexible enough to accept different tag sets, the tag set choice cansensibility affect the performance of the system. For more information we refer the reader toFreeLing documentation1.

6.2 Model Configuration and Training

6.2.1 Features

Our model configuration is basically the same as the one proposed by Andor et al. (2016)as shown in Figure 5.1. We used the arc-standard transition system and experimented withtwo different sets of features. In the first configuration we extracted words (Xw), part ofspeech tags (Xt), and dependency arcs and labels (X l) in the surround context of the state asshown in Table 6.2. In the second configuration we also added morphological features (Xm)if available. We used the following embedding sizes: 64 for words and 32 for all the otherfeatures.

SyntaxNet uses a flexible markup language to define feature functions:

• word (w) - Feature function to get the word from the focus token.

• tag (t) - Feature function to get the POS tag from the focus token.

• label (l) - Feature function for computing the label from focus token. Note that thisdoes not use the precomputed values, since we get the labels from the stack.

• morphology-set (m) - Feature function to get the list of morphological features fromthe focus token

1https://talp-upc.gitbooks.io/freeling-user-manual/content/tagsets.html

22 CHAPTER 6. EVALUATION OF SYNTAXNET ON MULTILINGUAL TASKS

Table 6.2: Feature templates in the first and second configuration. The highlighted row is used justin the second configuration.

Templates

Xw

i.w i(1).w i(2).w i(3).w st.w st(1).w st(2).w st(3).w st.rc(1).wst.rc(1).ls(1).w st.lc(1).w st.lc(1).rs(1).w st(1).rc(1).w st(1).rc(1).ls(1).wst(1).lc(1).w st(1).lc(1).rs(1).w st.rc(2).w st.lc(2).w st(1).rc(2).wst(1).lc(2).w

Xt

i.t i(1).t i(2).t i(3).t st.t st(1).t st(2).t st(3).t st.rc(1).t st.rc(1).ls(1).t st.lc(1).tst.lc(1).rs(1).t st(1).rc(1).t st(1).rc(1).ls(1).t st(1).lc(1).t st(1).lc(1).rs(1).tst.rc(2).t st.lc(2).t st(1).rc(2).t st(1).lc(2).t

X l st.rc(1).l st.rc(1).ls(1).l st.lc(1).l st.lc(1).rs(1).l st(1).rc(1).l st(1).rc(1).ls(1).lst(1).lc(1).l st(1).lc(1).rs(1).l st.rc(2).l st.lc(2).l st(1).rc(2).l st(1).lc(2).l

Xm

i.m i(1).m i(2).m i(3).m st.m st(1).m st(2).m st(3).m st.rc(1).mst.rc(1).ls(1).m st.lc(1).m st.lc(1).rs(1).m st(1).rc(1).m st(1).rc(1).ls(1).mst(1).lc(1).m st(1).lc(1).rs(1).m st.rc(2).m st.lc(2).m st(1).rc(2).mst(1).lc(2).m

And feature locators:

• input (i) - Parser feature locator for accessing the remaining input tokens in the parserstate. It takes the offset relative to the current input token as argument. Negative valuesrepresent tokens to the left, positive values to the right and 0 (the default argumentvalue) represents the current input token.

• stack (st) - Parser feature locator for accessing the stack in the parser state. The argu-ment represents the position on the stack, 0 being the top of the stack. Note that "stack"here means "words we have already tagged."

• left child (lc) - Parser feature locator for locating left children of the focus token. Theargument specifies the number of times the leftmost is applied.

• right child (rc) - Parser feature locator for locating right children of the focus token.The argument specifies the number of times the rightmost is applied.

• left sibling (ls) - Parser feature locator for locating left siblings of the focus token. Theargument specifies the sibling position relative to the focus token.

• right sibling (rs) - Parser feature locator for locating right siblings of the focus token.The argument specifies the sibling position relative to the focus token.

6.2.2 Training

We trained the model in two steps in order to speed up the training process (Andor etal., 2016): we first pretrained the network using the local objective given in Equation 5.4, andthen performed additional training steps using the global objective given in Equation 5.5. Wepretrained all layers (including the embeddings) except the soft-max in this way. Averagedstochastic gradient descent with momentum is used.

CHAPTER 6. EVALUATION OF SYNTAXNET ON MULTILINGUAL TASKS 23

If not specified we used an initial learning rate of 0.07/0.02 (greedy/structured), a momen-tum of 0.85/0.9 and a decay rate that corresponds to about one tenth of the training corpussize. All the models were trained with early stopping for a maximum of 12/20 epochs. Wedid not feed pre-trained word-embeddings, instead we normally initialized and learnt them.

6.3 Results

Table 6.3 shows our final parsing test-set results and a comparison with FreeLing. TheFreeLing models were trained with the exact same corpora used for our models. We providethe Unlabeled Attachment Score (UAS), the Labeled Attachment Score (LAS) and the LabelAccuracy (ACC) calculated using the CoNLL-2007 Shared Task Scorer:

• LAS - percentage of tokens for which a system has predicted the correct HEAD andLABEL.

• UAS - percentage of tokens for which a system has predicted the correct HEAD.

• ACC - percentage of tokens for which a system has predicted the correct LABEL.

The best results are highlighted column by column in bold. Our results significantly out-perform FreeLing for all the four languages under studying except the LAS for English.

As expected the model with best accuracies is the one with the larger beam width (β=64)with few exceptions. But if we consider that a small increase in beam width can cause a hugeincrease in training and decoding time, models with smaller beam width still perform goodwith an accuracy drop on average between 0.5% and 1.0%.

The last rows of Table 6.3 show the accuracy scores for the two configurations using mor-phological features too. We observe a drop in all the scores. This is likely due to the fact thatEAGLES tags already include rich morphological features and the duplication of informationjust cause unneeded noise.

In Chapter 5 we introduced Et and El as the dense representations of all PoS tags andarc labels. Chen and Manning (2014) showed that these embeddings carry some semanticinformation. We wonder whether the same thing happens using our trained models. Figure6.1 presents t-SNE visualizations of Et embedding, the PoS tags are colored by category. Itclearly shows that this embedding effectively exhibit the similarities between PoS tags.

24 CHAPTER 6. EVALUATION OF SYNTAXNET ON MULTILINGUAL TASKS

Table 6.3: Final dependency parsing test-set results. The β stands for beam width, the M indicates ifmorphology features are used. All the models use two hidden layers of 1024 units each, batch size of 64,learning rate of 0.07/0.02 (greedy/structure), momentum of 0.85/0.9 and a decay rate that correspondsto about one tenth of the training corpus size. All the models were trained with early stopping for amaximum of 12/20 epochs without pre-trained embeddings.

Catalan Spanish English GermanMethod UAS LAS ACC UAS LAS ACC UAS LAS ACC UAS LAS ACC

FreeLing 90.15 87.29 93.44 90.31 87.34 93.03 90.48 89.61 92.27 84.95 82.81 91.60

SyntaxNet β=8 91.60 88.47 93.66 91.54 88.51 93.56 91.26 88.46 92.76 88.98 86.89 94.06SyntaxNet β=16 92.01 88.90 93.93 92.42 89.40 93.89 91.81 89.09 93.15 89.12 87.05 94.13SyntaxNet β=32 92.09 89.09 94.00 92.23 89.30 93.88 91.95 89.24 93.26 89.49 87.45 94.30SyntaxNet β=64 92.11 89.14 94.08 92.40 89.50 93.99 92.07 89.37 93.31 89.73 87.82 94.57

SyntaxNet β=16 M=1 91.96 88.78 93.78 92.32 89.04 93.41 - - - - - -SyntaxNet β=32 M=1 91.99 88.95 93.99 92.41 89.45 93.91 - - - - - -

Figure 6.1: T-SNE visualization of POS tag embeddings trained with our models. The tags are coloredby categories. Tags that are similar in meaning are also closer in the embedding space.

Chapter 7

Evaluating SyntaxNet Parameters

We performed several experiments to study how each parameter affect the behavior of themodel. We present here the results. This study does not want to give precise numbers andstatistics but just to show how the choice of a parameter can affect model performances.

7.1 Mini-Batch Size

The batch size defines the number of samples that are propagated through the network.The two main things to consider when optimizing the mini-batch size are the time efficiencyof training and the noisiness of the gradient estimate. Regarding the time efficiency, largermini-batches are very attractive computationally. In addition to easy parallelism across pro-cessors/machines, they can give much better throughput, for example by making use ofefficient data reuse during back-propagation, and by wasting less time in data communica-tion round-trips. On to the noisiness part of the equation. When we compute the gradientof a mini-batch, what we’re really doing is approximating the gradient of the entire trainingset. Obviously, the gradient of a 10-batch is going to be a lot noisier than the gradient of a1000-batch. This means that we won’t necessarily be moving down the error function in thedirection of steepest descent.

Parsing performance using neural networks is highly sensitive to the batch size of training.Zhou et al. (2015) showed that a larger batch size improves the accuracy of their model. E.g.the accuracy on the development data improves from 85% to 91% by setting the batch size to10 and 100000 respectively.

We performed several experiments where we kept fixed all the parameters but not thethe batch size. We made sure to use the same random seeds to avoid that differences in per-formances were due to parameters initializations. On the choice of the batch size we werelimited by the available computational resources. An example can be seen in Figure 7.1.From these first two examples we can notice that our models tend to perform significantlybetter with smaller mini-batches. Table 7.1 and Figure 7.2 confirm this results.

We think that smaller mini-batches give better accuracies because the noisiness caused bysmall mini-batches isn’t always bad. E.g. suppose that our error function has a bunch oflittle valleys. If we use the entire training set to compute each gradient, our model wouldget stuck in the first valley it fell into (since it would register a gradient of 0 at this point). If

25

26 CHAPTER 7. EVALUATING SYNTAXNET PARAMETERS

ca 64x64 B=8 es 1024x1024 B=8

78

80

82

84

86

88

90

92

94

96

test

UA

S

batch 64 batch 128 batch 256batch 512 batch 1024

Figure 7.1: Parsing performance with different training mini-batch sizes for two models. We usedthe same training epochs 12/20 (greedy/structured). As we can seen increasing the mini-batch sizesresults in worse accuracies.

we use smaller mini-batches, on the other hand, we’ll get more noise in our estimate of thegradient. This noise might be enough to push us out of some of the shallow valleys in theerror function.

0 200 400 600 800 1,000 1,20084

86

88

90

92

Batch Size

UA

S

Figure 7.2: UAS results for severalconfigurations with different batch sizes. This

uses just complete configurations where we havethe results for all batch sizes.

∆ UAS

64→ 128 −0.36± 0.03

128→ 256 −0.61± 0.06

256→ 512 −0.60± 0.07

512→ 1024 −1.86± 0.37

Table 7.1: Final statistics onbatch size parameter. Thisuses all our experiments

configurations no matters ifthey are complete or not.

It can be argued that training with smaller mini-batches implies that the neural networkparameters are updated more often and with bigger learning rates hence the higher accuracy.We tried to increase the number of learning epochs linearly with the size of the mini-batch.

CHAPTER 7. EVALUATING SYNTAXNET PARAMETERS 27

0 1,000 2,000 3,00080

82

84

86

88

90

time in seconds

tuni

ngU

AS

batch 64batch 128batch 256batch 512batch 1024

(a)

0 1,000 2,000 3,00080

82

84

86

88

90

time in seconds

tuni

ngU

AS

batch 64batch 128batch 256batch 512batch 1024

(b)

Figure 7.3: Parsing performance with different training mini-batch sizes for the same model (Catalan64x64 greedy) with same number of epochs (7.3a) or with a bigger maximum number of trainingepochs for larger mini-batches (early-stopping is still applied) 7.3b.

Figures 7.3a and 7.3b show the UAS on tuning-set for the configuration Catalan 64x64 B=8in the greedy training step with different batches sizes and/or training epochs. Table 7.2contains the results for the structured training step. As we can see smaller mini-batches stilloutperform bigger ones: increasing the number of maximum epochs reduces the gap butdoes not fill it.

Table 7.2: Tuning UAS results increasing the number of maximum epochs for configurations withbigger mini-batches. As we can see increasing the number of epochs helps but configurations withsmaller batches still outperform the other ones.

Hidden Layers Beam Size Batch Size Epochs Greedy Epochs Structured tuning UAS

64x64 8 64 12 20 90.85

64x64 8 128 12 20 90.2464x64 24 40 90.28

64x64 8 256 12 20 89.4164x64 48 80 89.89

64x64 8 512 12 20 88.2164x64 96 160 89.94

64x64 8 1024 12 20 84.2864x64 192 320 88.87

7.2 Hidden Layers Sizes

We analyzed the parsing performances with different number of units in the two hiddenlayers in the Feed-Forward network of SyntaxNet (please refer to Figure 5.1). Intuitively weexpected larger networks to generalize better than smaller network, nevertheless knowingthat using too many neurons can result in several problems such as overfitting. Overfitting


occurs when a model is excessively complex, such as having too many parameters relativeto the number of observations.

Somewhat like the mini-batch experiments, we trained several models keeping fixed allthe parameters but not the hidden layers sizes. Figure 7.1 shows the UAS results on test-set for three model configurations. Table 7.3 shows the final statics on all the experimentedconfigurations. As expected larger networks tends to perform better than smaller one, butthe accuracy gain is insignificant if we try to increase the sizes too much (512×512→ 1024×1024). The risk of overfitting, measured as the difference between training UAS and testUAS, is also higher as showed in table 7.3.

de Ba=256 B=8 ca Ba=256 B=16 es Ba=128 B=1684

85

86

87

88

89

90

91

92

test

UA

S

64x64 128x128 256x256 512x512 1024x1024

0 200 400 600 800 1,0001,20084

86

88

90

92

Hidden Layers

UA

S

Figure 7.4: Parsing performance with different hidden layers sizes. The left figure shows three partic-ular configurations. The right one visualizes the trend for all the trained models. Ba stands for minibatch size, B for beam size.

Table 7.3: Final statistics for hidden layers sizes.

∆ UAS

64x64→ 128x128 0.44± 0.13

128x128→ 256x256 0.08± 0.02

256x256→ 512x512 0.20± 0.03

512x512→ 1024x1024 0.06± 0.02

Avg Overfitting

64x64 4.00± 0.31

128x128 4.34± 0.31

256x256 4.58± 0.18

512x512 6.30± 0.25

7.3 Beam Width

Beam search uses breadth-first search to build its search tree. At each level of the tree, itgenerates all successors of the states at the current level, sorting them in increasing order ofheuristic cost. However, it only stores a predetermined number, β, of best states at each level(called the beam width). Only those states are expanded next. The greater the beam width,the fewer states are pruned. With an infinite beam width, no states are pruned and beam

CHAPTER 7. EVALUATING SYNTAXNET PARAMETERS 29

search is identical to breadth-first search. The beam width bounds the memory required toperform the search. Since a goal state could potentially be pruned, beam search sacrificescompleteness.

Same as above we trained several configurations keeping fixed all the parameters but notthe beam width. Figure 7.5 shows the UAS results for three model configurations. Table 7.4and Figure 7.6 presents the final results for all our configuration. As expected a larger beamsize induces an higher accuracy.

de 1024x1024 Ba=64 ca 64x64 Ba=128 es 1024x1024 Ba=6484

85

86

87

88

89

90

91

92

93

test

UA

S

β = 8 β = 16 β = 32 β = 64

Figure 7.5: Parsing performance with different beam widths for three models. Ba stands for minibatch size, β for beam size. Increasing the beam width results in better accuracies.

0 10 20 30 40 50 60 70

89

90

91

92

Figure 7.6: UAS results for severalconfigurations with different beam widths. Most

of time increasing the beam width results inbetter accuracies.

∆ UAS

8→ 16 0.28± 0.04

16→ 32 0.16± 0.02

32→ 64 0.11± 0.01

Table 7.4: Finalstatistics for beam width

parameter for all ourexperiments.


As we can seen there is a sensible gain in accuracy increasing the beam width. Consideringthat even a small increment in the beam width can cause a huge amplification in the trainingand decoding times, smaller beam widths are preferable for every day use.

Chapter 8

Building a Bilingual Parser

Multi-Task Learning (MTL) is a subfield of machine learning in which multiple learningtasks are solved at the same time, while exploiting commonalities and differences acrosstasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. MTL works becauseregularization induced by requiring an algorithm to perform well on a related task can besuperior to regularization that prevents overfitting by penalizing all complexity uniformly.Even if MTL is successfully used in several fields, most of the available work on naturallanguage parsing systems focuses on devices that operate on a single language to generatethe parsed output.

We applied MTL to SyntaxNet. The underlying hypothesis is that there are some shared-structures across languages that we can exploit (e.g. Catalan and Spanish, German and En-glish, etc.). We jointly trained SyntaxNet for both English and German at the same time.

8.1 Related Works

This is not the first attempt to build a multilingual parser. Zeman and Resnik (2008) usedelexicalized parsing, in which a parser is built without any lexical features and trainedon a treebank for a resource-rich source language. The delexicalized parser is then applieddirectly to parse sentences in the target resource-poor languages. This exploit the idea thatidentical part-of-speech (POS) are highly informative of dependency relations, and that thereexists shared dependency structures across languages. Petrov et al. (2016) and Ma and Xia(2014) exploit parallel data as the bridge to transfer constraints from the source resource-rich language to the target resource-poor languages. Täckström et al. (2012) use paralleldata to induce cross-lingual word clusters adding them as features for their delexicalizedparser. Durrett et al. (2012) constructed a set of language-independent features and useda bilingual dictionary to transfer these features from source to target language. Duong etal. (2015) proposed a model for learning a shared "universal" parser that operates over aninterlingual continuous representation of language, along with language-specific mappingcomponents. They also incorporate a bilingual dictionary as a set of soft constraints on themodel, such that it learn similar representations for each word and its translation(s).

31

32 CHAPTER 8. BUILDING A BILINGUAL PARSER

8.2 Model

Our main contribution is the application of the method proposed by Duong et al. (2015) ontop of SyntaxNet. Unlike Duong et al. (2015) we don’t use a bilingual dictionary as a set ofsoft constraints on the model, but we use pretrained multilingual word embeddings. To ourknowledge this if the first attempt to do such a thing.

The architecture is illustrated in Figure 8.1. We named the component introduced byDuong et al. (2015) as language multiplexers. The idea is to allow the embedding matri-ces to differ in order to accommodate language specific features. Indeed different languagescan have different lexicon, part-of-speech can exhibit different roles, and dependency edgescan serve different functions. In a nutshell the language multiplexer selects the language-specific embedding matrix when forwarding or back-propagating values. The multiplexeris completely optional: e.g. if training two or more languages with the same part-of-speechtags set we can use the same matrix embedding Et for all the languages. During training,sentences are sampled from both languages to form the mini-batches.

Figure 8.1: Feed-Forward multi-lingual SyntaxNet architecture with language multiplexer.

Please note that even if here we focus the discussion on a bilingual parser, the same idea(and the same code) can be used for multilingual parsers.

8.2.1 Bilingual Word Embeddings

As already illustrated in Chapter 4 distributed word representations have been key to therecent success of many neural network models in addressing various NLP tasks. In Chapter4 we focused on monolingual word embeddings, however the existence of a wide variety

CHAPTER 8. BUILDING A BILINGUAL PARSER 33

of multilingual NLP tasks have motivated recent work in training bilingual representationswhere similar-meaning words in two languages are embedded close together in the samehigh-dimensional space. Figure 8.2 shows an example of bilingual word embedding betweenGerman and English. Several different approaches to training bilingual word embeddingscan be used:

• Bilingual Mapping - Word representations are first trained on each language indepen-dently and a mapping is then learned to transform representations from one languageinto another.

• Monolingual Adaption - The idea is to bootstrap learning of target representations fromwell trained embeddings of a source language, usually a resource-rich one like English,with a bilingual constraint to make sure embeddings of semantically similar wordsacross languages are close together.

• Bilingual Training - Unlike the previous schemes which fix pretrained representationson either one or both sides, attempts to jointly learn representations from scratch.

Figure 8.2: A shared embedding space between two languages (Luong et al., 2015)

For the purpose of this thesis we used bilingual word embeddings trained with the samerecipe used by Luong et al. (2015). We trained our joint model on the parallel Europarl-v7 cor-pus between German and English (Koehn, 2005). We pre-processed the corpus lowercasingand tokenizing the sentences and mapping each digit into 0, i.e. 2017 becomes 0000. Otherrare words occurring less than 5 times were mapped to <unk>. The resulting vocabulariesare of size 40K for English and 95K for German. We used the following settings: stochasticgradient descent with a default learning rate of 0.025, negative sampling with 30 samples,skip-gram with context window of size 5, and a sub-sampling rate of value 1e−4. The modelis trained for 10 epochs and the learning rate is decayed to 0 once training is done. We setthe hyper-parameters to 1 for α and 4 for β in our experiments. We used unsupervised align-ment information learned by the Berkeley aligner (Liang et al., 2006). For more informationplease refer to Luong et al. (2015).


8.3 Corpora

To train and evaluate the performances of our bilingual models we used the German and En-glish corpora for the CONLL-2017 shared task 1. Because the test sets were not yet availablewe had to extract them from the training sets. The corpora statistics are shown in Table 8.1.

Table 8.1: Corpora statistics for bilingual parsing. The stats on the number of words (lowercased ornon) refer just to the training set.

# training # tuning # test

English 11743 2002 800German 13318 799 800

# words # lc words

English 18091 15262German 45726 44107

8.4 Experiments

For training and decoding we used the set of features described in Table 8.2:

• Lowercased words (F lw) - We used lowercased words to improve the hit rate in case weuse pretrained word embeddings. In the monolingual models we used words withoutlowercasing them.

• Universal Tags (F t) - These are the universal part-of-speech tags, shared between allthe languages.

• Morphology - Features (Fm) -It is well known that morphological information is veryimportant for parsing morphologically rich languages. These are shared between thetwo languages and should compensate the use of universal part-of-speech tags insteadof languages specific tags.

• Labels (F l) - Universal dependency relation.

• Language of the sentence under focus.

In order to reduce the training time, we performed just the greedy step with two hiddenlayers 1024 × 1024, batch size of 64, initial learning rate of 0.07, a momentum of 0.85 and adecay rate that corresponds to about one tenth of the training corpus size. All the modelswere trained with early stopping for a maximum of 12 epochs. We experimented both withpretrained multi-lingual word embeddings and with normal initialization.

We aim to understand if the bilingual models outperform the monolingual ones and if thiscan be used for low-resource languages with small or no annotated tree banks. To do so wetrained our models with different ratios of the German corpus, e.g. 10%, 20%, etc with andwithout repetitions to guarantee a similar number of English and German sentences.

1http://universaldependencies.org/conll17/


Table 8.2: Feature templates for our bilingual parser.

Templates

F lw

i.lw i(1).lw i(2).lw i(3).lw st.lw st(1).lw st(2).lw st(3).lwst.rc(1).lw st.rc(1).ls(1).lw st.lc(1).lw st.lc(1).rs(1).lw st(1).rc(1).lwst(1).rc(1).ls(1).lw st(1).lc(1).lw st(1).lc(1).rs(1).lw st.rc(2).lw st.lc(2).lwst(1).rc(2).lw st(1).lc(2).lw

F ti.t i(1).t i(2).t i(3).t st.t st(1).t st(2).t st(3).t st.rc(1).t st.rc(1).ls(1).t st.lc(1).tst.lc(1).rs(1).t st(1).rc(1).t st(1).rc(1).ls(1).t st(1).lc(1).t st(1).lc(1).rs(1).tst.rc(2).t st.lc(2).t st(1).rc(2).t st(1).lc(2).t

Fm

i.m i(1).m i(2).m i(3).m st.m st(1).m st(2).m st(3).m st.rc(1).mst.rc(1).ls(1).m st.lc(1).m st.lc(1).rs(1).m st(1).rc(1).m st(1).rc(1).ls(1).mst(1).lc(1).m st(1).lc(1).rs(1).m st.rc(2).m st.lc(2).m st(1).rc(2).mst(1).lc(2).m

F lst.rc(1).l st.rc(1).ls(1).l st.lc(1).l st.lc(1).rs(1).l st(1).rc(1).l st(1).rc(1).ls(1).lst(1).lc(1).l st(1).lc(1).rs(1).l st.rc(2).l st.lc(2).l st(1).rc(2).l st(1).lc(2).l

8.5 Results

Table 8.3, Figure 8.3, and Figure 8.4 show the final results for both our monolingual andbilingual model with different German corpus ratios. As we see, at least when enforcing asimilar number of German and English sentences, our bilingual model significantly outper-form our monolingual models.

Even when no German corpus is used to train the model, the model gets close to 50% UAS.

Table 8.3: English and German UAS for our monolingual and bilingual model with different Germancorpus ratios. Words that are similar in mean, no matter the language, are close in the embeddingspace.

0% 10% 20% 30% 40% 50%en de en de en de en de en de en de

mono - - - 66.85 - 78.26 - 80.83 - 82.64 - 83.33bi 87.6 45 87.5 68.74 87.1 76.11 86.86 81.55 87.37 82.82 87.48 83.9bi w\rep - - 87.64 72.38 87.32 79.11 87.96 82.86 87.01 83.58 87.05 84.8

mono w\emb - - - 66.85 - 78.13 - 80.69 - 82.70 - 83.12bi w\emb 88.11 47.68 88.16 69.53 87.50 77.51 87.08 81.90 88.06 83.50 88.17 84.08bi w\emb and rep - - 86.83 72.30 87.82 78.76 87.41 82.38 87.18 84.12 86.93 85.1

60% 70% 80% 90% 100%en de en de en de en de en de

mono - 84.38 - 84.71 - 84.21 - 84.22 - 84.8bi 86.92 84.63 87.35 85.14 87.6 86.4 87.22 86.21 87.08 85.49bi w\rep 86.83 85.88 85.91 83.81 - - - - - -

mono w\emb - 84.84 - 84.71 - 83.84 - 84.22 - 84.80bi w\emb 87.9 85.34 88.07 85.48 87.48 85.96 87.76 86.56 87.71 85.68bi w\emb and rep 87.91 85.8 87.69 86.11 - - - - - -


40 60 80 10080

82

84

86

% German Corpora

Ger

man

UA

S

BilingualBilingual Repeated

Monolingual

(a)

0 20 40 60 80 10080

82

84

86

88

% German Corpora

Engl

ish

UA

S

BilingualBilingual Repeated

(b)

Figure 8.3: On the left, German UAS results with our bilingual (with and without repeated trainingsentences) and monolingual models with different ratio of German corpus used for training the models.On the right the English UAS.

40 60 80 10080

82

84

86

% German Corpora

Ger

man

UA

S

BilingualBilingual emb.

Bilingual rept. emb.

(a)

0 20 40 60 80 10080

82

84

86

88

% German Corpora

Engl

ish

UA

S

BilingualBilingual emb.

Bilingual rept. emb.

(b)

Figure 8.4: Same as Figure 8.3 but using pretrained word embeddings.

8.5.1 Effect of Corpus Repetition

We wonder if repeating sentences, to guarantee that inside a mini-batch there always asimilar number of English and German sentences, help to increase the accuracy. As we cansee in Table 8.5, at least for small corpus there is a substantial German UAS gain using rep-etition. The gain tend to decrease when the number of sentences in the German is alreadyenough. Enforcing a similar number of English and German sentences has also a significanteffect on the English UAS. This is shown in Figures 8.3 and 8.4.

8.5.2 Effect of Pre-trained Word Embeddings

. We wonder if using pretrained word embeddings help to increase the accuracy. As we cansee in table 8.5, without enforcing a similar number of sentences for the two corpora, usingpretrained embeddings usually comports an increase both in German and English accuracies(for the English UAS please refer to table 8.3). When enforcing a similar number of sentences


Table 8.4: Effect of enforcing a similar number of English and German sentences for different rationof the German corpus, with and without pretrained word embeddings. The reported accuracy is theGerman UAS.

10% 20% 30% 40% 50% 60% 70%

bilingual 68.74 76.11 81.55 82.82 83.9 84.63 85.14bilingual w\rep 72.38 79.11 82.86 83.58 84.8 85.88 83.81

10% 20% 30% 40% 50% 60% 70%

bi w\emb 69.53 77.51 81.90 83.50 84.08 85.34 85.48bilingual w\emb and rep 72.30 78.76 82.38 84.12 85.1 85.80 86.11

for the two corpora, this is not always true.

Table 8.5: Effect of using pretrained word embeddings, with and without corpus repetition. Thereported accuracy is the German UAS.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

bilingual 45 68.74 76.11 81.55 82.82 83.9 84.63 85.14 86.4 86.21 85.49bilingual w\emb 47.68 69.53 77.51 81.90 83.50 84.08 85.34 85.48 85.96 86.56 85.68

10% 20% 30% 40% 50% 60% 70%

bilingual w\rep 72.38 79.11 82.86 83.58 84.08 85.88 83.81bilingual w\rep and emb 72.30 78.76 82.38 84.12 85.10 85.80 86.11

Chapter 9

Conclusion

We introduced the reader to SyntaxNet going through the concepts of dependency pars-ing, transition-based dependency parsing and neural network models for transition-baseddependency parsing. We then trained SyntaxNet for four different languages and comparedthe accuracies with FreeLing, showing a significant gain in accuracies. We then experimentedwith several configurations, tweaking the mini-batch size, the number of units in the hid-den layers and the beam-width. Somehow unexpected we showed that small mini-batchestend to perform better even if similar studies on similar networks showed the exact oppo-site. This has big implications for everyone training a SyntaxNet model: training with largermini-batches is preferable to reduce the training times but will deeply affect the accuraciesof the system.

Last but not least we built a bilingual parser, training jointly a model for English andGerman. The idea is that we can exploit similarities between languages. We performedsimilar experiments with different ratios of the German corpus, in order to understand if ourbilingual parser can be used to improve dependency parsing model for languages with smallor no-existing tree banks. We experimented using bilingual word embeddings and forcinga similar number of English and German sentences inside a mini-batch, showing that bothtechniques significantly help to increase the accuracies of the trained models. The resultson the multi-lingual parser can be extremely useful in several applications. In all the caseswe have to deal with low-resource languages a multi-lingual model can be effectively usedto get competitive results. Considering also the increasing interest in performing inferencedirectly in embedded devices instead that in powerful servers, this work can be used to buildmodels that can deal with more than one language and at the same time requiring a limitedamount of memory and storage resources.

9.1 Future Work

For future work we aim to understand how bilingual models help with languages witheven bigger similarities, e.g. Catalan and Spanish, and with languages with low similarities.It’s also interesting to understand how the model behave when training with three or morelanguages.

We also acknowledge the limit of the used pretrained word embeddings. Some recentworks explored the idea of using Gaussian (or mixture of Gaussians) word embeddings.

38

CHAPTER 9. CONCLUSION 39

Those have some nice properties, e.g. they can deal with words with more that one meaning.

Bibliography

Erick Alphonse, Sophie Aubin, Philippe Bessières, Gilles Bisson, Thierry Hamon, San-drine Lagarrigue, Adeline Nazarenko, Alain-Pierre Manine, Claire Nédellec, MohamedOuld Abdel Vetah, Thierry Poibeau, and Davy Weissenbacher. 2004. Event-based infor-mation extraction for the biomedical domain: The caderige project. In Proceedings of theInternational Joint Workshop on Natural Language Processing in Biomedicine and Its Applica-tions, JNLPBA ’04, pages 43–49, Stroudsburg, PA, USA. Association for ComputationalLinguistics.

Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, KuzmanGanchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-basedneural networks. arXiv preprint arXiv:1603.06042.

Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas, and Jörg Tiedemann.2006. Question Answering for Dutch Using Dependency Relations, pages 370–379. SpringerBerlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-540-45700-8.

Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. InProceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 957–961, Prague,Czech Republic, June 2007. Association for Computational Linguistics.

Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser usingneural networks. In EMNLP, pages 740–750.

Simon Corston-Oliver Chris Quirk. 2006. The impact of parse quality on syntactically-informed statistical machine translation. In Proceedings of EMNLP 2006. ACL/SIGPARSE.

Fabio Ciravegna and Alberto Lavelli. 1999. Full text parsing using cascades of rules: An in-formation extraction perspective. In Proceedings of the Ninth Conference on European Chapterof the Association for Computational Linguistics, EACL ’99, pages 102–109, Stroudsburg, PA,USA. Association for Computational Linguistics.

Michael Collins. 2002. Discriminative training methods for hidden markov models: The-ory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conferenceon Empirical methods in natural language processing-Volume 10, pages 1–8. Association forComputational Linguistics.

Hang Cui, Keya Li, Renxu Sun, Tat seng Chua, and Min yen Kan. 2005a. National universityof singapore at the trec 13 question answering main task. In In Proceedings of the 13 thTREC.

40

BIBLIOGRAPHY 41

Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. 2005b. Question answer-ing passage retrieval using dependency relations. In Proceedings of the 28th Annual Inter-national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’05, pages 400–407, New York, NY, USA. ACM. ISBN 1-59593-034-5.

Yuan Ding and Martha Palmer. 2005. Machine translation using probabilistic synchronousdependency insertion grammars. In Proceedings of the 43rd Annual Meeting on Associationfor Computational Linguistics, ACL ’05, pages 541–548, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. A neural network model forlow-resource universal dependency parsing. In EMNLP, pages 339–348. Citeseer.

Greg Durrett, Adam Pauls, and Dan Klein. 2012. Syntactic transfer using a bilingual lexicon.In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning, pages 1–11. Association for ComputationalLinguistics.

Valentin Jijkoun, Maarten De Rijke, and Jori Mur. 2004. Information extraction for questionanswering: Improving recall through syntactic patterns. In In Coling 2004, pages 1284–1290.

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MTsummit, volume 5, pages 79–86.

Alexander Koller and Kristina Striegnitz. 2002. Generation as dependency parsing. In Pro-ceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02,pages 17–24, Stroudsburg, PA, USA. Association for Computational Linguistics.

Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of themain conference on Human Language Technology Conference of the North American Chapter ofthe Association of Computational Linguistics, pages 104–111. Association for ComputationalLinguistics.

Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representa-tions with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector SpaceModeling for Natural Language Processing, pages 151–159.

Xuezhe Ma and Fei Xia. 2014. Unsupervised dependency parsing with transferring distribu-tion via parallel guidance and entropy regularization. In ACL (1), pages 1337–1348.

Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a largeannotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330. ISSN0891-2017.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation ofword representations in vector space. arXiv preprint arXiv:1301.3781.

Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Comput.Linguist., 34(4):513–553. ISSN 0891-2017.

42 BIBLIOGRAPHY

Lluís Padró and Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality.In Proceedings of the Language Resources and Evaluation Conference (LREC 2012), Istanbul,Turkey, May 2012. ELRA.

S. Petrov, R. McDonald, and K. Hall. 2016. Multi-source transfer of delexicalized dependencyparsers. US Patent 9,305,544.

R. Sætre, K. Sagae, and J. Tsujii. 2008. Syntactic features for protein-protein interaction ex-traction. In C. J. O. Baker and J. Su, editors, Short Paper Proceedings of the 2nd InternationalSymposium on Languages in Biology and Medicine (LBM 2007), volume ISSN 1613-0073319,pages 6.1–6.14, Singapore, January 2008. CEUR Workshop Proceedings (CEUR-WS.org).

Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A new string-to-dependency machinetranslation algorithm with a target dependency language model. In In Proc. of ACL, pages577–585.

Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clustersfor direct transfer of linguistic structure. In Proceedings of the 2012 conference of the NorthAmerican chapter of the association for computational linguistics: Human language technologies,pages 477–487. Association for Computational Linguistics.

David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training forneural network transition-based parsing. arXiv preprint arXiv:1506.06158.

Yuehua Xu and Alan Fern. 2007. On learning linear ranking functions for beam search. InProceedings of the 24th international conference on Machine learning, pages 1047–1054. ACM.

Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between relatedlanguages. In IJCNLP, pages 35–42.

Weixiong Zhang. 1999. State-space search: Algorithms, complexity, extensions, and applications.Springer Science & Business Media.

Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A neural probabilisticstructured-prediction model for transition-based dependency parsing. In ACL (1), pages1213–1222.

www.kth.se

Evaluating Globally Normalized Transition Based …1105281/...On March 2016 Google published a...

Documents

Transcript of Evaluating Globally Normalized Transition Based …1105281/...On March 2016 Google published a...