How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine...

37
How can Artificial Intelligence use Big Data for Translating Documents? John Ortega ∙ November 27, 2019 ∙ Vilnius, Lithuania

Transcript of How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine...

Page 1: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

How can Artificial

Intelligence use Big

Data for Translating

Documents?

John Ortega ∙ November 27, 2019 ∙ Vilnius, Lithuania

Page 2: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Who I am and what I do?

Page 3: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Machine TranslationThere is no need to do more than mention the

obvious fact that a multiplicity of language impedes

cultural interchange between the peoples of the

earth, and is a serious deterrent to international

understanding. -- Warren Weaver

Page 4: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

State-of-the-Art

Two Paradigms

Basic Introdution

Evolution and Motivation

Winding Down

19402019beginning to present

OverviewWhat I will cover today

Page 5: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Statistical and Neural-Based

Systems with Big Data

Rule-Based MT Systems

The challenge and first systems.

Example-Based MT Systems

2019Quality to Achieve

Human-Like

Translations

1940A Challenge for a

Basic Need

Page 6: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing
Page 7: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Source Sentence MT System Translation

The dog barks at nightSource Sentence

Translation El perro ladra por la noche

The MT system is a black box generally but we need to now what it does.

Machine TranslationInput and Output

Page 8: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

MT

EBMT

Analogy

RBMT

Direct

Transfer

Interlingual

SMT

Word

Phrase

NMT

CNN

RNN

Machine TranslationBreaking down the paradigms

Page 9: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

High Use

RBMT

20%60%

10%

Rule-Based

Hybrid Machine Translation

RBMT with NMT or SMT

60%

https://www.psmarketresearch.com/market-analysis/machine-translation-market

Medium

Use

SMT

Low UseNMT

Machine TranslationSector use in production is mixed

Page 10: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

PunctuationLong

Sentences

High

Quality

Fast

Training

Example-

Based

Rule-Based

Statistical

Neural

Machine TranslationContrasting strengths of each machine translation system

Page 11: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Read Source Language

Input and Add Rule

System Digests Rules

and Includes in

Catalog

Apply Rule To Target

Language and Verify

System Applies Hierachy

According to Language

Morphology

System Provides Rule-

Based Translation

Rule-Based Machine TranslationAn editing process

Page 12: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Rule-Based Machine TranslationUsing hand-tailored rules

Page 13: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Traceable

Customization

High Quality

Repeatable

Statistical-Based Machine TranslationUsing various statistical methods to choose the best translation

BLEU is the scoring mechanism for translation

quality!

Page 14: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

12%

Statistical MT

Based on

Phrases

Moses53%

Page 15: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

30.3

31

33

32.3

34.7

33.8

31.5

31.3

33.9

27.7

0 5 10 15 20 25 30 35 40

SMT

NMT

40 50 60 70 80

More than

6 points of

difference

Statistical Machine TranslationComparison to Neural Machine Translation

BLEU scores for Average Sentence Lengths

Page 16: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Statistical Machine TranslationWell-known research paper by Koehn and Knowles on six challenges for machine

translation

Page 17: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Slow Training

Highly Complex

High Quality

Less

Error

Neural Machine TranslationThe new kid on the block

Page 18: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Phrase-Based MT

Neural MT generally captures

semantic details well.

Neural Machine Translation

Page 19: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

01 Decode

Choose the best hypothesis

02 Encode

Formulate a hypothesisEncoderDecoderModel

• Sentence-by-Sentence

• Embeddings on the Source are

Encoded

• Classifications are the Target

language

Neural Machine TranslationAn encode-decode model machine translation system

Page 20: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

• Free out-of-the box system

• Pre-trained word embeddings

• RNN

• Guillaume Klein et. al in 2017 from

Harvard and Systran

• Predecessor – Nematus

Neural Machine TranslationOpenNMT – a free neural machine translation system

Page 21: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

OpenNMT: Neural Machine Translation Toolkit – Klein et.

al

Neural Machine TranslationOpenNMT – a free neural machine translation system

Page 22: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

A comparison was done with Knowles on various MT paradigms

and testing on fuzzy-match repair.

F u z z y - M a t c h R e p a i r Te s t

Neural Machine TranslationTesting neural machine translation performance

Page 23: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Neural Machine TranslationThe best non-hybrid machine translation system at this moment

Page 24: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Data that is typically not easily computed using conventional personal computer

mechanisms.

Books20 MB

Chat Data

News

Articles

5 MB

10 MB

Big DataTraining our system by transferring knowledge into it

Page 25: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

CPU GPU TPU

• Billions of words

• Tokenization and Cleansing in hours

• Tagging in hours or a day

• Modeling can take weeks or months

Big DataHigh-performance processing of data

Page 26: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

• Multiple sources and languages

• Transfer from other models

• Performance-based

Knowledge

• Pre-calculated models (Embeddings)

• Only time to find index (hash lookup)

• Vector Space Modeling

Computations

Big DataTransfer knowledge from one source to another

Page 27: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

27

Pre-trained

Based on

Words

Represent

Billions of

Words

Knowledge

Transfer

Trained with

Neural

Networks

No Guarantees

Semantic Knowledge Transfer

Word EmbeddingsCapture knowledge to transfer from one source to another

Page 28: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

• Input to a model (linear or not)

• Use vectors from previous run

• Provide similarity not easy to

measure

• Easy-to-load

Word EmbeddingsBuilding an input classifier from learned knowledge

By embedding learned knowledge we are able to capture semantic

context for translating also.

Page 29: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Tokenize / Cleanse

• A classification task using a Neural

Network

• Bag-of-words or Skip Gram

Training

Saving

• Stem words or lemmatize

• Remove stop words

• Freeze model after classification

• Indexes saved to disk for new words

Word EmbeddingsSteps for creating our own word embeddings

Page 30: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Word2VecSkip-Gram

or

CBOW

GloveLog-bilinear

probability

representatio

n

FastTextWord2Vec

extension

Many word embeddings are freely downloadable for almost any project!

The main idea is that an algorithm builds a co-occurrence matrix.

Word EmbeddingsTypes of word embeddings

Page 31: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Word2Vec

https://code.google.com/archive/p/

word2vec/

local vocab_size = 50004 local embedding_size = 500 local

embeddings = torch.Tensor(vocab_size,

embedding_size):uniform() torch.save('enc_embeddings.t7',

embeddings)

OpenNMT Torch Loading

Physical Appearance

Page 32: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Training

Decoding or

Classification

• Several convolutions through hidden

layers

• Rectified Linear Unit (RELU) and

Softmax activation

• Embedding vectors used for each

word or words

• Sentence segments contain words from

embeddings

• Classification is done using dense layers

to match desired classes

Word EmbeddingsTrain a model to use word embeddings

Page 33: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Word embeddings serve as

input word vectors to an

RNN for source language

words.

• Sentence by sentence

encoding/decoding

• Context captured and carried forward

• Grammar

• Syntax

• Phrasal rules

Word EmbeddingsAttention-based machine translation using word embeddings

Page 34: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

English German

• Performs best on in-domain data

• Little difference between Nematus and

ONMT

In-Domain

Other Languages

• BLEU score measurement improves

• Speed goes down

• Good for low-resource language

• Good for romance languages (ES, FR, PT,

IT)

• Not so good for some language pairs (EN-

-RU)

Neural Machine Translation with Word EmbeddingsComparison of Nematus to OpenNMT with English and German

Page 35: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

Reinforcement LearningLearn by probing

Transfer LearningBring in knowledge

Heuristic LearningArtifact knowledge over time

Mixed LearningCombine several types

Artificial IntelligenceCan we make our MT system learn like humans?

Page 36: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

The Best Machine Translation SystemWhich one should you use for your project?

Training Time

Language

pairs

Cost

Size on disk

Human

quality

BLEU scores

Easy to

reproduce

Featureless

Page 37: How can Artificial Intelligence use Big Data for ... · OpenNMT –a free neural machine translation system. A comparison was done with Knowles on various MT paradigms and testing

THANKS!