Post on 15-Jul-2020
1
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
TRANSLATING THE MACHINE TRANSLATION HYPE
TABLE OF CONTENTS
Introduction……………………………….…1 Machine Translation Quality.…….….2 What Is the Main Purpose of the MT Provider………………..……..…….….4 Traveling Hype Waves – SMT to NMT to Deep NMT…………....…………6 Bottom Line………………….………….….8 About Omniscien Technologies..…10
INTRODUCTION
There is a running joke in the translation industry that machine
translation will be a solved problem in 5 years. This has been updated
every 5 years since the 1950’s. The promise was always there, but the
technology of the day simply could not deliver.
However, there has undoubtedly been progress. When we started our
company over 10 years ago, MT was considered a joke and ridiculed.
At the time, we looked to purchase MT technology, but were unable
to find any technology that could rise to the challenges that we
needed to address. At the time we had the formidable task of
translating all of Wikipedia and other high volume and very mixed
content domains from English into mostly second or third-tier Asian
languages.
To find a way to address that problem, we flew in Philipp Koehn to
introduce him to our concepts and ideas. Philipp at the time was a
promising researcher in Statistical Machine Translation (SMT) and
recently created a large corpus of bilingual content derived from
European Parliament documents and had just released the first
version of the Moses SMT decoder which in the meantime has become
the de-facto platform for SMT. Soon after, Philipp joined our team as
Chief Scientist and has been driving research and development efforts
ever since. Philipp joining the team gave us the in-depth knowledge of
MT that few had at the time and allowed us to stay ahead of the
competition from a technical perspective. While Philipp was the father
of the Moses decoder, his Master’s thesis in the 1990’s was on Neural
Networks before they become practical for machine translation.
The hype about Neural Machine Translation (NMT) and Artificial
Intelligence (AI) has hit a new peak in the last year, with claims from
Google and others that translations produced by their MT technology
is difficult to distinguish between human translation. This has raised
eyebrows, with some impressed, some in laughter and as expected a
2 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
MACHINE TRANSLATION QUALITY
Type of statements that will be/have been made:
- “3 out of 4 human translators liked us more than another MT
provider.”
- “We scored better on the BLEU metric than another MT
vendor.”
- “…almost impossible to distinguish between machine and
human translations”
These statements are vague and often misleading. A group of 4
humans evaluating text is not at all meaningful. The margin of error
is substantial and while it might be indicative, it is not scientific. But
moving beyond the number of linguists in the sample, there are a
number of other variables that are not transparent in such simplistic
statements. There is usually no background or other information on
the linguists their qualifications, biases (i.e. who they work for could
change their views, mother tongue, domain expertise, skill level or
what side of bed they got up on that morning. In other words, such
vague and unqualified statements should be seen as nothing but
marketing.
lot of skepticism. Riding on the wave of attention and hype surround
NMT and AI are a number of companies that claim to deliver
products or solutions that leverage the latest and greatest
approaches, some of whom are established and some new to the
space. One of the more recent launches has attempted ride the hype
wave by claiming to beat Google and many other large number
claims that to most sound great, but lack any real meaning once you
dig a little deeper and understand more fully the claims. In making
such claims, the next wave of hype that MT is so often criticized for
has been reignited, and perhaps even relaunched. The next wave of
hype is going to be deep learning AI and in the case of machine
translation that is known as Deep NMT. Today there are a handful of
MT vendors, such as Omniscien Technologies, who are delivering
Deep NMT commercially as compared to those offering NMT. Expect
there to be some confusion around the difference between NMT and
Deep NMT for some time.
But before we get too excited about claims of great quality and all
the other usual types of marketing and counter statements that go
along with each wave of MT quality hype, let’s take a more pragmatic
view of MT and how to read the hype, understand the issues and ask
the right questions.
3 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
There are several ways to address this issue to make such
statements hold water. First, qualifying who the linguists are that are
providing their opinions on translation quality is important, as is the
quantity of reviewers. In the example of 3 out of 4, just 1 more
linguist deciding that the competing MT provider was better would
make it 50%. The scope of a single or small number of opinions has
too much impact on the confidence of the metric. The other key
issue is the size of the content that was measured. Many very small,
with as little as 100 sentences or less. With 100 sentences, there is a
margin of error of 9.8%, which could easily swing scores in the
direction of 2 out of 4 or 50%. Trying to pass off translation quality
with a representation of an entire language with as little as 100
sentences is not realistic.
The problem with making such claims and failing to qualify them is
that they can easily be countered by competition or others in the
industry. As an example, after the recent launch of DeepL with some
big but unqualified quality claims, Slator did an informal survey
(https://slator.com/features/reader-polls-mt-quality-lsps-risk-data-
breach-universities/) which indicated only 28% preferred DeepL,
with 28% preferring Google, 23% were unsure and 21% said it
depends. This will happen nearly every time when these claims are
made, as it did when Google made the big NMT announcement in
2016 claiming to be nearly indistinguishable from human translators.
A BLEU score is the most irrelevant metric if it is unqualified or used
improperly, which unfortunately is too often. First and foremost, a
BLEU score must be calculated from a truly blind test set. See
https://omniscien.com/comparing-machine-translation-engines for
a comprehensive list of criteria that need to be established to
provide a solid and reliable metric. Additionally, it is not uncommon
for NMT to score lower on the BLEU metric than Statistical and Rules
based MT, yet when reviewed by human experts, the NMT is more
fluent and preferred. Automated metrics are very good as quality
indicators, but to have full confidence the results should be verified
by human. That makes metrics like BLEU very useful for automated
comparison, and engine training and analysis tasks, but less so if you
want to know which translation output would be more acceptable in
the real world. Where automated metrics such as BLEU are very
useful and do not need to be verified by humans is when comparing
the improvement of an individual engine over time. An increase in
score when comparing a newer version of an engine to an older
version of the same engine will typically indicate a quality
improvement. More on this a little later…
HIGHLIGHTS
“A BLEU score is the most
irrelevant metric if it is
unqualified or used
improperly. Metrics like BLEU
are useful for automated
comparison, and engine
training and analysis tasks,
but less so if you want to
know which translation
output would be more
acceptable in the real world.”
4 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
WHAT IS THE MAIN PURPOSE OF THE MT PROVIDER? Understanding the purpose of the MT provider dramatically changes
the way in which quality is perceived and how translations are created.
Google, Microsoft Translator and new entrants like DeepL are trying
to translate anything for anyone, any time. As such, by this mission the
MT engines must be trained on very large amounts of data to cover as
many domains as possible. While this has some benefits, it also
emphasizes several issues. When you translate something with a
generic MT engine, you get whatever the engine decides is best based
on their generic giant bucket of data. This usually means statistically
best and considers only the individual sentence being translated to
determine context rather than the entire context of the full text being
translated. The sentence “I went to the bank” is a simple sentence, but
depending on the context it is being used in, the meaning is very
different. Consider the following 3 examples:
• The water was cold and the current strong. I went to the bank.
My feet sunk into the mud.
• I had run out of cash. I went to the bank. It had just closed.
• I was in the lead of the last lap and the final turn. I went to the
bank. The g-force pulled me down into the seat.
The source of test set being measured by any of these metrics is also
very relevant and seldom qualified. NMT does very poorly when the
test set is out of domain, while SMT handles out of domain content
much better. Alternatively, NMT handles in-domain content
generally better than SMT in the most common conditions. But in
either case, if an engine is trained with data from the domain of the
test set, it will have a bias to that domain and do better on this
specific test set than another. As an example, if you put a hotel
description into Google, you get a pretty reasonable translation in
most cases. If you then change that to an engineering manual for a
new kind of transport such as the hyper-loop you will find that it will
typically be less useful. This is simply a matter of one set of data being
more suitable than another set of data. As such, the test set and the
training data needs some level of qualification. It is equally easy to
make the test set too close to the training data, which results in a
recall instead of a translation. It is easy to make one MT provider look
good and another bad and then invert the results by simply changing
the domain data that is being measured. There is no single MT engine
that can handle all domains well, which brings me to my next point…
5 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
With the additional sentences surrounding “I went to the bank.”, it is
possible to determine the context of “river bank”, “ATM” or “banking
the car into the turn”, which is quite easy for a human. However,
today that is not how MT works. MT engines currently only
understand a single sentence at a time, with each sentence in the
document often split across many machines in order to translate
more quickly. In a generic system that is trying to translate anything
in any domain, it is very easy for the machine to determine the wrong
domain and translate in the wrong context. This issue is very
common place for MT providers that build very large all-
encompassing MT engines.
Specialized MT providers, such as Omniscien Technologies, address
this issue by building customized MT engines that are designed for a
purpose and trained on a smaller set of specialized in-domain
content. This often helps to resolve the issue known as Word Sense
Disambiguation – where a word can have many meanings depending
on the context of use. For example, if we customize an engine for the
Life Sciences domain, then it will have been trained to understand
the concept of a virus in the context of medicine rather than the
context of a computer virus. When training an engine on data that is
directly relevant to the domain, the engine has a deliberate bias
towards that domain and as a result when translating in-domain
content will produce a much higher quality, less ambiguous and in-
domain translation. A frequent mistake is to think that more data is
better, so data from other domains is mixed in. While this approach
may help in some cases, it can easily result in creating the same issue
faced by generic engines where data from multiple domains or even
multiple customers data can conflict and bias an engine causing
undesirable results.
Additionally, some MT providers further enhance the training data
with data manufacturing and synthesizing. At Omniscien
Technologies, we do this for all custom engines as a means to extend
the data that a customer can provide us, which in turn delivers more
in-domain training data for an engine to learn from. It is very rare
that a customer has sufficient quantities (1 million+ bilingual
sentences) of data needed for a high-quality engine with their data
alone. This is where data manufacturing and synthesis comes in.
Omniscien Technologies has developed a range of tools in this space
that create both bilingual and monolingual data that is in-domain
that leverages the customers data and domain to speed the process
and build out as much as 1-2 billion words of content to further
enhance an engine. While this process takes a little time, the results
are a much higher quality in-domain engine. The resulting
“Specialized MT providers
generally build customized
MT engines that are designed
for a purpose and trained on a
smaller set of specialized in-
domain content in order to
resolve Word Sense
Disambiguation.”
“When training an engine on
data that is directly relevant
to the domain, the engine has
a deliberate bias towards that
domain and as a result when
translating in-domain content
will produce a much higher
quality, less ambiguous and
in-domain translation.”
HIGHLIGHTS
6 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
specialized engine is often fine-tuned for the correct writing style (i.e.
marketing, user manuals, engineering and technical document) which
further increases quality and reduces editing effort. Additionally,
platforms like Omniscien Technologies’ Language Studio™ are able to
provide control features such as glossary and terminology, do-not-
translates, formatting and complex data handling that is not provided
by a generic MT offering. Our ability to script rules in the translation
workflow means that complex content such as patents and e-
commerce can also be easily handled.
The net result of a customized and in-domain engine is that almost
every time the engine is much higher quality than Google or other
generic/translate anything kinds of MT engines. The negative of this
approach is that it does require some effort to customize, which is
taken on in the case of Omniscien Technologies by a team of linguistic
specialists, and there is a resulting short delay while customization is
occurring before you can begin translating. Additionally, if you were
to then try and translate something out of domain (i.e. sending a
document that discusses computer viruses into an engine trained for
life sciences) the resulting translation would be out of context and
thus lower quality. Just like a human with specialist domain skills, a
trained and specialized MT engine will always produce a higher quality
translation than a generic one-size-fits-all solution.
HIGHLIGHTS
“The net result of a
customized and in-domain
engine is that the engine is of
much higher quality than
generic/translate anything
kinds of MT engine such as
Google. A trained and
specialized MT engine will
always produce a higher
quality translation than a
generic one-size-fits-all
solution.”
“In 2017 Omniscien
Technologies’ training time
for NMT has dropped from 1
month to 1 day and our
translation speeds on a single
GPU have increased from
3,000 words per minute to
40,000 words per minute
making the technology
commercially viable.”
TRAVELING HYPE WAVES – SMT TO NMT TO DEEP NMT
TRAVELING HYPE WAVES – SMT TO NMT TO DEEP NMT So now that we understand some of the key issues around the hype
and the differences, capabilities and purpose of generic vs. specialized
MT providers and MT engines, what should we expect next? Our Chief
Scientist, Professor Philipp Koehn is very optimistic on the ongoing
progressive improvement of MT technology and has worked with our
team to constantly improve our offerings. In the last year our training
time for NMT has dropped from 1 month to 1 day and our translation
speeds on a single GPU have increased from 3,000 words per minute
to 40,000 words per minute. These are progressive improvements
that make technologies such as NMT more commercially viable and
practical for real world use. Waiting a month for an engine to train is
not very practical. Waiting a single day makes a notable difference in
many areas, including how often an engine should be retrained and
improved. But as we head into Deep NMT, training times are again
extending from the 1 day of NMT to 5 days. This is still practical as
even a large SMT engine can take almost as long to train, but is it
worth the additional wait and what will Deep NMT bring that is not
available in NMT or SMT?
So now that we understand some of the key issues around the hype
and the differences, capabilities and purpose of generic vs. specialized
MT providers and MT engines, what should we expect next? Our Chief
7 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
One of the reasons that neural networks are in fashion now again,
after their last peak in the 1980s/1990s, is that modern hardware
allows the stacking of several layers of computing. This is known as
deep learning. Just like longer sequences of instructions of traditional
computer programming code allows for more elaborate algorithms,
so do several neural networks layers allow for more complex
processing of input signals. In the case of deep neural machine
translation, several layers of computation enable more complex
processing of the input words to separate out their distinct senses.
The deep neural machine translation system of Omniscien
Technologies uses several layers in the encoder (to better understand
input words) and in the decoder (to be better aware of long-distance
interactions between words).
In a recent project designed to compare NMT and Deep NMT
technologies, we trained 2 language pairs (English > Japanese and
Japanese > English) on a deliberately limited (~1 million segments) set
of training data in the life sciences domain. The test set was verified
as a blind test set (not in training data, but in-domain) and each engine
was trained on the identical data, with the only difference being the
technology used to train the engines. We also included metrics from
Google and Bing as a further comparison to compare against a
specialized MT engine vs. a generic MT engine.
Language Pair
MT Engine BLEU F-Measure Levenshtein
Distance TER
EN-JA
Omniscien Deep NMT 48.01 77 14.78 35.98
Omniscien NMT 36.55 70 19.24 48.06
Google 31.74 66 20.30 52.17
Bing 23.00 60 24.65 61.58
Language
Pair MT Engine BLEU F-Measure
Levenshtein Distance
TER
JA-EN
Omniscien Deep NMT 33.92 70 39.05 49.59
Omniscien NMT 28.82 67 44.77 56.41
Google 26.80 65 43.38 54.58
Bing 17.32 56 53.65 65.97
The above metrics demonstrate the following:
• Even on a limited amount of data (~1 million segments), a
specialized engine trained on in-domain data can easily
outperform a generic MT engine such as Google or Bing that
is trained on much larger quantities of data (billions of words).
This shows that the quality and domain suitability of the data
is much more important than the volume of data.
One of the reasons that neural networks are in fashion now again,
after their last peak in the 1980s/1990s, is that modern hardware
HIGHLIGHTS
HIGHLIGHTS
“In the case of deep neural
machine translation, several
layers of computation enable
more complex processing of
the input words to separate
out the distinct senses.”
“Even on a limited amount of
data (~1 million segments), a
specialized engine trained on
in-domain data can easily
outperform a generic MT
engine. This shows that the
quality and domain
suitability of the data is much
more important than the
volume of data.”
Language Pair
MT Engine BLEU F-Measure Levenshtein
Distance TER
JA-EN
Omniscien Deep NMT 33.92 70 39.05 49.59
Omniscien NMT 28.82 67 44.77 56.41
Google 26.80 65 43.38 54.58
Bing 17.32 56 53.65 65.97
“Even on a limited amount
of data (~1 million
segments), a specialized
engine trained on in-domain
data can easily outperform a
generic MT engine. This
8 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
• Although the training data was identical, the improvement
when switching from NMT to Deep NMT is notable. In the case
of EN-JA, the BLEU score increased by 11.46 points and JA-EN
the BLEU score increased by 5.1 points.
• Language pair and direction matter. Although both EN-JA and
JA-EN were trained on the identical data, the score improved
notably more for EN-JA than JA-EN.
• Had this test not been limited to ~1 million segments, and
more in-domain data added, the quality difference would
have been even higher than what is shown in the above table.
Further extending this analysis, Google has included patents from
English and German as part of its training data, but also includes a
wide variety of other data. Omniscien Technologies has trained
engines exclusively on patent data with approximately ~12 million
segments of bilingual sentence. The comparison in the table below
shows clearly that the focus on a specific domain delivers a notably
higher quality translation result than mixing wide varieties of data.
Language Pair
MT Engine BLEU F-Measure Levenshtein
Distance TER
EN-DE Omniscien Deep NMT 43.10 72 58.99 38.94
Google 39.52 69 64.05 42.22
DE-EN Omniscien Deep NMT 58.80 81 40.47 27.08
Google 52.10 78 51.28 32.37
• Although the training data was identical, the improvement
when switching from NMT to Deep NMT is notable. In the case
of EN-JA, the BLEU score increased by 11.46 points and JA-EN
the BLEU score increased by 5.1 points.
• Language pair and direction matter. Although both EN-JA and
JA-EN were trained on the identical data, the score improved
notably more for EN-JA than JA-EN.
• Had this test not been limited to ~1 million segments, and
more in-domain data added, the quality difference would
have been even higher than what is shown in the above table.
Further extending this analysis, Google has included patents from
English and German as part of its training data, but also includes a
wide variety of other data. Omniscien Technologies has trained
engines exclusively on patent data with approximately ~12 million
segments of bilingual sentence. The comparison in the table below
shows clearly that the focus on a specific domain delivers a notably
higher quality translation result than mixing wide varieties of data.
BOTTOM LINE
BOTTOM LINE
There are several key take-aways from this blog post:
• Expect the hype on MT to increase with some focus moving to
Deep NMT. Sometimes it will be a technological advance that
should be easily validated and proven, but in many others, it
will be just hype and marketing.
• Expect some offering NMT to claim they are delivering deep
learning or Deep NMT. We have seen some MT vendors that
are confusing the market by making such claims.
• Any real claim of quality will eventually be verified either
formally or informally by third parties. After so much hype for
so long, anyone making quality claims should be prepared to
have those claims independently verified.
There are several key take-aways from this blog post:
• Expect the hype on MT to increase with some focus moving to
Deep NMT. Sometimes it will be a technological advance that
should be easily validated and proven, but in many others, it
will be just hype and marketing.
9 Copyright © 2017 Omniscien Technologies. All Rights Reserved.
• Generic MT (irrespective of technology used) can be easily
and consistently beaten by a specialized custom engine
trained with in-domain data, even when the quantity of data
is quite limited.
• Lack of control of terminology, vocabulary and simple
business rules is a limiting factor for generic MT and
specialized MT vendors will vary in their support for features
that allow control.
• There are many MT providers that are doing little more than
putting a “pretty” interface over top of open MT toolkits such
as OpenNMT and Moses. This allows people to make MT
engines, but not necessarily make high quality MT engines as
many other features such as data manufacturing and
synthesis are not available. Do-It-Yourself MT suffers the
same fate – it implies you know how to do it yourself.
Specialist skills and technologies are necessary in order to
deliver optimal quality.
• More data is not the solution to high quality MT. The right
data that is focused on solving a problem is the solution. This
data can be provided by a customer or
manufactured/synthesized.
Ultimately while machines may produce MT, humans are going to
decide how useful it is. The ultimate definition of translation quality is
“whatever the customer says it is”. The MT must be suitable for the
customer’s purpose. Haggling over whether one generic MT is better
over another ignores the most important factor in that of the
suitability of the MT for a desired purpose by the consumer of the MT.
The quality of MT to meet customer requirements is determined by
understanding that purpose and customizing an engine to meet that
requirement.
Copyright © 2017 Omniscien Technologies. All Rights Reserved.
www.omniscien.com sales@omniscien.com I +66 (2) 381 5869
ABOUT OMNISCIEN TECHNOLOGIES
ABOUT OMNISCIEN TECHNOLOGIES
Omniscien Technologies is a leading global supplier of high-
performance and secure high-quality Language Processing, Machine
Translation (MT) and Machine Learning technologies and services for
content intensive applications. Our wide range of solutions serves
clientele from various industries including the Localization Industry,
Online Research Services, Publishing, E-Commerce, Media and
Entertainment, Online Travel, Technology, Enterprise and
Government.
Omniscien Technologies has gained a reputation for cutting edge
solutions with its Language Studio™, E-Commerce Studio™ and
Media Studio™ platforms. Depending upon customers’
requirements, the platforms can be deployed in a variety of ways to
integrate with in-house data processing and translation
management systems for the localization industry as well as other
systems. The platforms offer unparalleled levels of customization
and control including feature rich pre- and post-processing, enabling
customers with even the most complex data to achieve both high
quality and high-volume output to satisfy every use case. Omniscien
Technologies has by far the most comprehensive and feature rich
systems in the market today.
Covering more than 550 global language pairs and with a number of
industry specific solutions, Omniscien Technologies remains the
partner of choice for customers with complex, high-volume bespoke
data processing and machine translation needs.
For further information on Omniscien Technologies please visit
www.omniscien.com or contact sales@omniscien.com