Mirai Translate - TAUS Tokyo 2015

Post on 15-Jul-2015

218 views 0 download

Tags:

Transcript of Mirai Translate - TAUS Tokyo 2015

© 2015 Mirai Translate, Inc. All rights reserved.

Mirai Translate, Inc.

1

“Impossible only means that you have still screwed up the solution.”

-Mick Etoh

© 2015 Mirai Translate, Inc. All rights reserved.2

Number of Inbound Visitors in 2014

13,413,567

JPY2,030,500,000,000 EUR15,522,200,000

© 2015 Mirai Translate, Inc. All rights reserved.

Translation Total Addressable Market (2014)

3

USD  2.1B

MT market    USD  10M

© 2015 Mirai Translate, Inc. All rights reserved.

Unforeseen Challenges Ahead

4 Translation Speed (1/cost)

Quality

LSP Solutions

IP

Publication

Reports

CAT

Speech Translater

Google Translate

Web

Crowd Sourcing Solutions

SOHO SOHO

MT+Post Editing Solutions

MT Real Time Solutions

Unforeseen New Market Frontier

性能向上による 新領域

© 2015 Mirai Translate, Inc. All rights reserved.

72% of Japanese don’t speak English.5

© 2015 Mirai Translate, Inc. All rights reserved.6

Vision To realize a society in which everyone can interact freely across language barriers with the use of machine translation technology, and thereby

contribute to invigoration and innovation in businesses.

Mirai Translate, Inc.

© 2015 Mirai Translate, Inc. All rights reserved.

Mirai  Translate  as  Joint  Venture

7

Mobile  Platform  Leader ASR  &  MT  Solution  Provider Multilingual  Enterprise  MT  developer

NLP  and  MT  technology  leader Multilingual  SMT  technology  leader

Technology  Transfer

© 2015 Mirai Translate, Inc. All rights reserved.8

Our Competence• Multiple Translation Engines from Systran and NICT

• MT Training Tools from Systran • NLP Tools Named Entity Extraction, Pre-Ordering,…

• NL Data Assets Corpus from Systran and NTT DOCOMO+ JPN Ontology Dictionary

• Strong Technical Team Experiences in AWS, Data Mining, MT toward our own original MT systems.

© 2015 Mirai Translate, Inc. All rights reserved.

Siri

9Big-Data, Big-Server, and Fat-Pipe Solution

© 2015 Mirai Translate, Inc. All rights reserved.

“Shabette-Concier” Voice agent service

• Launched Mar. 1, 2012

• Over 40 services in it

• Including chatting

• 10 million users

ShabetteVoice

= ConcierConcierge

=How may I help you?

10

© 2015 Mirai Translate, Inc. All rights reserved.

Touch the Concier.“Tell me how to make a pizza.”View a list of recipes of pizza.You can check a detailed recipe of pizza.“Tell me Italian restaurants nearby.”View a list of Italian restaurants.You can check detailed information of restaurants.11

© 2015 Mirai Translate, Inc. All rights reserved.Touch the Concier.Q: “What is the height of Mt. Fuji?”A: “3,766m!”Q: “When is holding schedule of the Tokyo Olympic Games?”A: “It will hold in 2020.” 12

© 2015 Mirai Translate, Inc. All rights reserved.

Basic Architecture 2010

Logging

Fuetrek VoiceRecognition

DOCOMO TaskRecognition

Logging

Voicetext text contents

Service Providers’ DB

contents

text

Text to speech

13

Fat-Pipe

Big-Servers

© 2015 Mirai Translate, Inc. All rights reserved.

Mirai Architecture 2015

Logging

Fuetrek VoiceRecognition

Mirai MT Engines

Logging

Voicetext text contents

Client Dictionary Corpus DB

contents

text

Text to speech

14

© 2015 Mirai Translate, Inc. All rights reserved.15

© 2015 Mirai Translate, Inc. All rights reserved.

We are Cloud Natives

16

システム構成部品

who believe our cloud solution is scalable and safer!

© 2015 Mirai Translate, Inc. All rights reserved.

Bilingual  User  Dictionaries

SYSnitionTRAN  7  HYBRID  ENGINE

SYSTRAN  Hybrid  Architecture

17

Source

Transl

ation

Main  Dictionaries  Linguistic  Rules

User  Entities

Rules-­‐Based  MT

Statistical  Post-­‐Edition

SBS BSTarget  

Monolingual  Corpus

Source  Adaptation

BSMonolingual  Source  Corpus

Bilingual  Corpus  or  Translation  Memories

Bilingual  Translation  Models

Target  Language  Models

Source  Language  Models

Self-­‐training

Source  Normalization  Dictionaries

Self-­‐Training

Self-­‐Training

SBS

Statistical  MT

Translation  Memories

Bilingual  Terminology  Extraction

 Spell  Check Homographs

Target  Normalization  Dictionaries

Translation    Memories

Pre-Filter Formating Normalization Segmentation

Entity Recognition

Translation Memory User Dictionary Match

Post-Processing Formatting Normalization

Post-Filter

a Commercial SMT Engine

© 2015 Mirai Translate, Inc. All rights reserved.

NTT  Technology  for  JPN  <->  EN

18

He saw a cat a long tail

this  is  Keiko  Tanaka  .                                                                                                            this  _va0  Keiko  Tanaka  is  .                                                                                                                          田中 恵子 と 申し ます  

i  used  to  jog  every  morning  .                                              i  _va0  every  morning  jog  to  used  .                                                                                                                                                                                  毎朝 ジョギング し た もの です 。

she  was  wearing  a  sweater  and  high  heals  .              she  _va0  sweater  and  high  heals  _va2  wearing  was  .   セーター を 着 て 、 ハイヒール を はい て い まし た 。

with sawcatwithlong tailが をHe

Post-Positional Particles

© 2015 Mirai Translate, Inc. All rights reserved.

Commerce

Patent Application

Finance

Corpus is the king,

19

Not only Size(Coverage) but also Fitness.

Written Language Corpus Variation

SpokenLanguage

CorpusVariation Generic

Corpus

Travel

Public Patents

Ideal Corpus Data

but it must be decent and well-structured.

© 2015 Mirai Translate, Inc. All rights reserved.20

SYSTRAN Training Server ‒ Main components

• Corpus Manager • Mono/bilingual corpus • Txt, html, doc, docx, rtf, xlsx, pptx, pdf, tmx • Virtual file management (aggregation, split) • Content Management Database (TU : Translation Units)

• Training Manager • Baseline Evaluation (Quality metrics: GTM, BLEU, TER) • Hybrid Model Training (SPE : Statistical Post-Edition) • Statistical Model Training (SMT : Statistical Machine Translation)

• Dictionary creation (UD) with bilingual terminology extraction • Dictionary validation (UD) against a bilingual corpus (TMX) • Translation Memory creation (TM) with document aligner

© 2015 Mirai Translate, Inc. All rights reserved.

Training Methodology

21

Collect  Data Run  Training Evaluate Publish  to  Pilot/Production    

• Collect training data • Define the domain • Collect bilingual corpus (translation memories, documents and translations) • Collect monolingual corpus (text, content relevant to the domain) • Collect terminology if any (bilingual dictionaries, glossaries)

• Run initial training

• Evaluate

• Perform incremental cycles

© 2015 Mirai Translate, Inc. All rights reserved.22

V.S.

© 2015 Mirai Translate, Inc. All rights reserved.

• Collaboration Tools • Intranet Translation Portal  • Web & Mobile Apps  • Customer Service Portal

• Market Intelligence  • Cyber-security  • Forensic & eDiscovery Apps  • Text Mining & Analytics

• Multilingual Web Site  • Technical Translation Project  • Translation Workflow Integration

Help and secure information

communication

Detect critical information within large scale foreign

data

Reduce costs and timelines for translation

projects

Business cases

Usages & Applications

Customers Translation Agencies & Corporations

Defense & Securities & Legal Organizations

Corporations & Public Organizations

LocalizationMultilingual Communication Big Data by HPC

Our Business Targets• 3 main markets

23

© 2015 Mirai Translate, Inc. All rights reserved.24

Multilingual  MT  JP,  EN,  CN,  KR  +ASEAN

Enterprise  Solutions

Consumer  Services

We  are  an  engineering  company…

MT  APIs  TMS

© 2015 Mirai Translate, Inc. All rights reserved.25

“It always seems impossible until it’s done.” - Nelson Mandela

As part of the Tomorrow television series produced by CBS for MIT's Centennial in 1961

© 2015 Mirai Translate, Inc. All rights reserved.

Their dreams are coming true.

Mirai Translate, Inc.26

@mickbean