LEARNING WORD TRANSLATIONS

18
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastav a 24 July 2008 NON PARALLEL CORPORA

description

NCLT/CNGL Internal Workshop. LEARNING WORD TRANSLATIONS. 24 July 2008. N ON P ARALLEL C ORPORA. Ankit Kumar Srivastava. Does syntactic context fare better than positional context?. Learning a Translation Lexicon from non Parallel Corpora. Motivation Methodology Implementation - PowerPoint PPT Presentation

Transcript of LEARNING WORD TRANSLATIONS

Page 1: LEARNING WORD TRANSLATIONS

LEARNINGWORD

TRANSLATIONS

LEARNINGWORD

TRANSLATIONS

Does syntactic context fare better than positional context?

NCLT/CNGLInternal Workshop

Ankit Kumar Srivastava

24 July 2008

NON PARALLEL CORPORA

Page 2: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 2

Learning a Translation Lexicon from non Parallel Corpora

Motivation

Methodology

Implementation

Experiments

Conclusion

Master’s Project

AT

University of Washington Seattle, USA

JUNE 2008

Page 3: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 3

{ lexicon }

Word – to – word mapping between 2 languages

Invaluable resource in multilingual applications like CLIR, CL resource, CALL, etc.

MO

TIV

AT

ION

Wahl

election 0.85

ballot 0.10

option 0.02

selection 0.02

choice 0.01

Sheridan & Ballerini 1996McCarley 1999

Yarowsky & Ngai 2001Cucerzan & Yarowsky 2002

Nerbonne et al. 1997

Page 4: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 4

{ corpora }

Parallel, comparable, non-comparable textMore monolingual text than bitext5 dimensions of nonparallelnessMost statistical clues no longer applicable

MO

TIV

AT

ION

time period

topic domain

author

language

Page 5: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 5

{ task }

Given any two pieces of text in any two languages…

…Can we extract word translations?

MO

TIV

AT

ION

Page 6: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 6

{ insight }

If two words are mutual translations, then their more frequent collocates (context window) are likely to be mutual translations as well.

Counting co occurrences within a window of size N is less precise than counting co occurrences within local syntactic contexts [Harris 1985].

2 types of context windows – Positional (window size 4) and Syntactic (head, dependent)

ME

TH

OD

OL

OG

Y

Page 7: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 7

{ context }ME

TH

OD

OL

OG

Y

Vinken will join the board as a nonexecutive director Nov 29 .

POSITIONAL:

SYNTACTIC:

Vinken will join the board as a nonexecutive director Nov 29 .

Vinken will join the board as a nonexecutive director Nov 29 .

Page 8: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 8

{ algorithm }

For each unknown word in the SL & TL, define the context in which that word occurs. Using an initial seed lexicon, translate as many source context words into the target language.Use a similarity metric to compute the translation of each unknown source word. It will be the target word with the most similar context.

ME

TH

OD

OL

OG

Y

Koehn & Knight 2002

Rapp 1995, 1999Fung & Yee 1998 Otero & Campos 2005

Page 9: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 9

{ system }IMP

LE

ME

NT

AT

ION

CORPUS CLEANING

PCFG PARSING

1

2

CORPUS

Page 10: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 10

{ pre-process }EX

PE

RIM

EN

TS

ENGLISH GERMAN

DATA Wall Street Journal (WSJ) Deutsche Presse Agentur (DPA)

YEARS 1990,1991 and 1992 1995 and 1996

COVERAGE 446 days of news text 530 days of news text

Raw Text Corpora:

Phrase Structures:

1

2 Stanford Parser (Lexicalized PCFG) for English and German http://nlp.stanford.edu/software/lex-parser.shtml

[Klein & Manning 2003]

Page 11: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 11

{ system }IMP

LE

ME

NT

AT

ION

CORPUS CLEANING

PCFG PARSING

PS TO DSCONVERSION

1

2

3

DATASETS

4

CORPUS

Page 12: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 12

{ pre-process }EX

PE

RIM

EN

TS

ENGLISH GERMAN

TEXT 1,521,998 sentences 808,146 sentences

TOKENS 36,251,168 words 14,311,788 words

TYPES 276,402 words 388,291 words

3 Dependency Structures: Head Percolation Table [Magerman 1995; Collins 1997] was used to extract head-dependent relations from each parse tree.

4 Data Sets:

Page 13: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 13

SEEDLEXICON

{ system }IMP

LE

ME

NT

AT

ION

CORPUS CLEANING

PCFG PARSING

PS TO DSCONVERSION

1

2

3

CONTEXT GENERATOR

PARSEDTEXT

RAWTEXT

5

DATASETS

4

SYN VECTORS

POSVECTORS

CORPUS

Page 14: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 14

{ vector }

Seed lexicon obtained from a dictionary, identically spelled words, spelling transformation rules.Context vectors have dimension values (co occurrence of word with seed) normalized on seed frequency.

EX

PE

RIM

EN

TS

ENGLISH GERMAN

DIMENSION 2,376 words

SEED 2,350 words 2,376 words

UNKNOWN 74,434 words 106,366 words

5 Context Vectors:

Page 15: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 15

SEEDLEXICON

{ system }IMP

LE

ME

NT

AT

ION

CORPUS CLEANING

PCFG PARSING

PS TO DSCONVERSION

1

2

3

CONTEXT GENERATOR

PARSEDTEXT

RAWTEXT

5

DATASETS

4

SYN VECTORS

POSVECTORS

VECTORSIMILARITY

6L1

L2

RANK

CORPUS

TRANS.LIST

Page 16: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 16

{ evaluate }

Vector similarity metrics used are city block [Rapp 1999] and cosine. Translations sorted in descending order of scores.Evaluation data extracted from online bilingual dictionaries (364 translations).

EX

PE

RIM

EN

TS

CITY BLOCK COSINE

POSITIONAL CONTEXT 63 out of 364 148 out of 364

SYNTACTIC CONTEXT 301 out of 364 216 out of 364

6 Ranked Translations Predictor:

Page 17: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 17

{ endnote }

Extraction from non parallel corpora useful for compiling lexicon from new domains.Syntactic context helps in focusing the context window, more impact on longer sentences.Non parallel corpora involves more filtering, search heuristics than in parallel.Future directions include using syntactic only on one side, extending coverage through stemming.

CO

NC

LU

SIO

N

Page 18: LEARNING WORD TRANSLATIONS

July 24, 2008 Lexicon Extraction ~ Ankit 18

{ thanks }