Sudeshna Sarkar IIT Kharagpur...The ultimate multilingual search system “Given a query in any...

Multilingual Digital Libraries Multilingual and Crosslingual Retrieval and Access

Sudeshna Sarkar

IIT Kharagpur

UNESCO-NDL INDIA INTERNATIONAL WORKSHOP ON KNOWLEDGE ENGINEERING FOR DIGITAL LIBRARY DESIGN

Asd Asd Asd Asd Ad As ad

Many languages

World

• 6909 living languages

India

• India has 122 major languages (22 are constitutionally recognized) and 2371 dialects (Census 2001)

• Multiple scripts

• Population of literates

– 20% of India understand English

– 80% cannot

• Rich collection of books and materials in different Indian language

• The Divide

• The availability of resources, training

data, and benchmarks in English

leads to a disproportionate focus on

a few languages and a neglect of

many popular languages

Multilingualism and Universal Access • Promotion and Use of

Multilingualism

• Promotion of linguistic and cultural diversity

• Universal access to cyberspace

• Not all languages are equally represented. The universal digital library should ideally bridge this gap letting users find objects in languages different from their native one.

Diversity of languages in India

Kashmiri: 0.54

Nepali: 0.28 Sindhi: 0.25

Dogri: 0.22

Non-Scheduled

languages: 3.44

Konkani: 0.24

Manipuri: 0.14 Bodo: 0.13

Sanskrit: Negligable

Punjabi: 2.83

Assamese: 1.28

Oriya: 3.21

Malayalam: 3.21

Kannada: 3.69

Gujarati: 4.48

Urdu: 5.01

Tamil: 5.91

Marathi: 6.99 Telugu: 7.19 Bengali: 8.11

Santali: 0.63

Maithili: 1.18

Hindi: 41.03

Motto of NDL: Inclusivity Reach every Indian

Roles of Language in Digital Libraries

• Language associated with objects

– Metadata

– Content

• Language of interactions

– Such as query

• Interface language

• A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search and retrieve functions are language independent.

• Features – The user can choose the language for the

interfaces – The user can choose the interaction

language – The user can access metadata and

content in any desired language

Multilingual Search and Access

• Cross-Language Information Retrieval (CLIR)

– Querying multilingual collections in one language in order to retrieve documents in another language

• Multilingual Information Retrieval (MLIR)

– Process information (queries, documents, both) in multiple languages

• Multilingual Information Access (MLIA)

– Query, retrieval and presentation of information in any language

Documents

(TL)

CLIR Query

(SL)

Results (TL)

Documents

(L2)

MLIR Results

Documents

(L1 )

Documents

(L43 )

Query (𝐿𝑖)

The ultimate multilingual search system

“Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.”

Douglas W. Oard and David Hull, AAAI Symposium on Cross-Language IR, Spring 1997, Stanford, USA

Enabling Multilinguality of Data Data = Content plus metadata

• Language Identification

– Associate language tag with data.

– Automatic language identification.

• Multilingual Indexing and Presentation

– controlled vocabularies for metadata fields such as subject with different values for different languages

– Mapping of words across languages

– Automatic translation of metadata fields such as title and abstract

– Automatic translation of content

Enabling Effective User Interactions

• Enable querying in different languages

• Assist in query formation, query translation and query selection

• User feedback on result examination

• Assistance in Browsing and Data Translation

Language Technology

enabling Multilingual Digital Library

Normalization and spelling variation

• अगँरेजी, अगँरेजी, अगेँजी, अगेँजी, अगंरेजी, अगंरेजी, अगेंजी, अगेंजी

• अनतरराषरीय, अनतरराषरीय, अनतराररषरय, अनतरारषरीय, अतंरराषरीय, अतंरारषरीय, अतंरराषरीय, अनतराि षरय, अनतराषरीय

• পাখি, পািী • কে াা, কো

Stemming and Lemmatization

• Stemming: different morphological variations of the same word map to the same canonical form - Okay for monolingual search

• Lemmatization: Map words to their root dictionary form Needed for effective cross-lingual and multi-lingual search

Tokenization, Script rendering

Indexing, Query formation, retrieval, ranking

Multilingual Digital Library A Reality?

• The Divide Resource Rich Languages – Lots of resources – Annotated Resources Resource Poor Languages – Zero or less resource – Annotated resource difficult to

create

• MT is not yet good enough to translate large texts and for specialized technical domains

• Technology and NLP has made huge strides! • High quality tools developed that have pushed the accuracy of MT systems.

• Tremendous growth in data and resources.

Neural Networks in NLP

• Neural Network and Deep Learning State of the art in many NLP problems

– Entity and Relation finding

– Parsing

– Machine Translation

– Entity linking

• Transfer learning from resource rich to resource poor languages. Cross-lingual transfer learning enablers

– Shared representation

– Bilingual and multilingual resources

What is required for building Multilingual Libraries

Retrieval

• Indexing objects in multiple languages

• Enabling access through query in any language

• Challenges: dictionary coverage, synonymy and polysemy, OOV words

• Language independent indexing

Access

• Enable access in language of familiarity of user

– Machine Translation

Representation: Word Embedding

• Continuous vector space representation

• Represented as dense vector 𝑥𝑖 ∈ 𝑅𝑑 to

each word.

• Learn these embeddings in a way that similar words are nearby in that space.

• Capture the regularities and relationship between words.

A key secret sauce for the success of many NLP systems across tasks

Distributed Representations of Words and Phrases and their Compositionality Mikolov etal, NIPS 2013

Multilingual Embeddings

16

children enfants

money argent

loi law

life vie

monde world

pays country

war guerre

peace paix energy

energie

market marche

• Learn a shared embedding space between words in all languages.

• Many benefits:

– Transfer learning

– Cross lingual information retrieval

– MT

– Parallel corpus extraction

Mikolov etal, 2013; Faruqui & Dyer, 2014; etc

Multilingual word embedding • Resources:

– Word aligned data

– Sentence aligned data

– Dictionary

– Document Aligned Corpora

Methods:

• Monolingual mapping

– Train monolingual word embeddings and learn linear mapping or CCA

• Pseudo-cross-lingual:

– Train on pseudo-cross-lingual corpus by mixing contexts of different languages

• Cross lingual training

– Train on parallel corpus

• Joint Optimization

– jointly optimise a combination of monolingual and cross-lingual losses.

Learning from Parallel Corpora

• Uses monolingual datasets to learn monolingual features. • Sampled bag-of-words for each parallel sentence as the crosslingual objective. • Learns representations to model individual languages well. • Encourages representations to be similar for words that are related across two languages.

Linear Projection

Learn monolingual embeddings Project to a common space using a dictionary

𝑚𝑖𝑛𝑊 𝑊ℎ𝑖 − 𝑒𝑖

2

𝑛

𝑖=1

Final embedding of hi = Whi

1Mikolov et.al., Exploiting Similarities among Languages for Machine Translation

Cross-lingual word vector projection using CCA

• ∑′ ⊂ ∑ , ∑′ ∈ ℝ𝑛×𝑑1 Ω′ ⊂Ω,Ω ∈ ℝ𝑛×𝑑2 of words that are dictionary translations.

• Compute Canonical Co-relation Analysis (CCA) on ∑’ and Ω’

• CCA projects these vectors to a third space where they are maximally co-related

Learning Bilingual Word Embeddings

• Input: – 𝑒𝑆𝐿, 𝑒𝑇𝐿 (source language

embeddings)

– 𝑒𝑇𝐿 (target language embeddings)

– D (seed dictionary)

• Repeat – W←LearnMapping (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)

– D← LearnDictionary (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)

• Until stopping criteria

Monollingual embedding

Dictionary

Dictionary

Dictionary

Dictionary Mapping

Mapping

Mapping

Learning bilingual word embeddings with (almost) no bilingual data, by Mikel Artetxe Gorka Labaka Eneko Agirre, ACL 2017

Learning from Document-Aligned Corpora

English-Hindi-Bengali Document Aligned articles from Wikipedia

• Merge the documents to get a “pseudo-trilingual document" • Remove the sentence boundaries • Randomly shuffle the pseudo-trilingual document. • Intuition : Each word w, regardless of its actual language, obtains word

collocates from both vocabularies. • Train word2vec (skip-gram model) on this document

3Vulic. et. al. , SIGIR 2015

Sample Results

How to incorporate embeddings for IR

1. Expand query using

embeddings (followed by non-

neural IR) - Add words similar to

the query

2. IR models that work in the

embedding space - Centroid

distance, word mover’s distance

• Data – Explicit relevance

judgments

– Implicit user behaviour – e.g., click through data

Application to Multilingual IR

4Bhattacharya et. al., CICLing, 2016

• For OOV words

• कैं सर - cancer, disease • स्पीकर - speaker, parliament

• Example Query Translations • भारतीय ससंद में आतंकवादी हमला -

Indian Parliament constitutional terrorist assault

• आईफ़ोन आईपैड डडज़ाइन लोकप्रियता लांच - iPhone iPad popularity unveiled

Representation learning based models can estimate relevance based on semantic matches with query

Multilingual Clustering using Word Embeddings

• Construct a graph G = (V, E) • V = vertices, words from the languages • E = edges, exist if the similarity between the

vertices is above a particular threshold • Edges are weighted as cosine similarity value

पॉललथीन

फें कना

जूता

कचरों hurl

throw

dart bags

सफाईकमी

झपटा wash

buckets

गमले splash

toilet

• Use Louvain [Blondel et al., 2008] for cluster detection • Performs hard clustering • O(nlogn)

Cluster Formation

Application to Cross-Language IR Original Query Q in

source language q1q2q3 …. qn

Map qi to cluster Ck

Pick target language words

from Ck

Query translated to target language

flag

emblem president

ध्वज झंडा वन्दे

मातरम secretary

CEO

चेयरमैन

सीईओ chairperson

अध्यक्ष

পতাো প্রতীে

বন্দে

currency economics

inflation

মুদ্রাস্ফীখত

অর্থনীখত

आर्थिक मुनाफा prices

money

cost

টাো

অর্থ

িরচ

Results of the Clustering Approach

• “Euro” is related to sports and not economics. • No Cluster method wrongly predicts the context and suggests words like ‘banknotes”. • On the other hand, pairwise clustering understands that “cup” is related to some sports, “football” to be more

specific. • Multilingual clustering restricts to a shorter query and hence translates to only “trophy” and “cup”

From Word Embedding to Query/Document Embedding

• Bag of embedded words: sum or average of word vectors

– Effective only for short text

DSSM: Deep Structured Semantic Model

• or Deep Semantic Similarity Model.

• Represent query q and document d in a continuous semantic space

• model semantic similarity between g and d using cosine similarity.

• Force the representations – For relevant (q,d+) pairs to be close in the latent space

– Irrelevant (q, d-) pairs to be far in the latent vector space

Deep Structured Semantic Model (DSSM) [Huang et al., 2013]

Role of Knowledge Graphs in Information Retrieval

• Knowledge graphs for semantic search – Entities, attributes, types, relations, etc.

• Popular (semi)structured data sources – Dbpdia – Freebase – Yago – Conceptnet – Knowledge graphs of Google, Microsoft

• Use graph structure for relevance computation Utilizing Knowledge Bases in Text-centric Information Retrieval – Tutorial by Laura Dietz, Alexander Kotov, and Edgar Meij. In Proceedings of the Conference on Web Search and Data Mining (WSDM). 2017

Multilingual Knowledge Graph

• Knowledge Graph connects words and phrases of Natural Language with labeled edges.

• Knowledge Graph combined with Word Embedding

• [Speer17] use ConceptNet to build semantic spaces that are more effective than distributional semantics alone.

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge Robert Speer, Joshua Chin, Catherine Havasi, AAAI 2017

in desired language

Access

Access

• Translation from language of object to desired language

• Shortcomings of MT – Great progress in sentence level MT

– Does not capture discourse level translation

– May not work on specific domain verticals not represented in g training set

– Less resourced language pairs are at a disadvantage

Neural Machine Translation

• An encoder processes the source sentence and creates a compact representation

• This representation is the input to the decoder which generates a sequence in the target language

• Both encoder and decoder are RNNs

Encoder Decoder Model

Source sentence

Target sentence Encoder

Decoder

• End-to-end training: All parameters are simultaneously optimized

• Distributed representations share strength

• Better exploitation of context

.

.

.

.

.

.

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

.

.

.

.

.

.

.

.

.

.

Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]

A set of annotation vectors {ℎ1, ℎ2, … , ℎ𝑇} For each target word 𝑦𝑡,

1. Compute alignment score

e 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1, ℎ𝑗 , 𝑠𝑡−1

2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗

αt,T

h1 h1

x 1

αt,1

αt,2 h2 h2

x 2

αt,3 h3 h3

x 3

hT hT

x T

y t-1

st-1

+

y t

st

Attention Model for MT: Learning to Align and Translate jointly

Continuous Vector Space

• Similar sentences are close in this space

• Multiple dimensions of similarity encoded.

Multilingual Machine Translation

• Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation; by Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean; Transactions of the Association of Computational Linguistics – Volume 5, Issue 1, Oct 2017

• Multi-Way, Multilingual Neural Machine Translation; by Firat, Kyunghyun Cho, B sankaran, FTY Vural, Yoshua Bengio, Journal; Computer Speech and Language, Volume 45 Issue C, September 2017

utilize more data sources

• Multi-lingual: learn from many language pairs?

• SMT-inspired: utilize monolingual data?

• Multi-task: combine seq2seq tasks?

Multilingual Translation

• Number of parameters grows linearly w.r.t. number of languages

• Multi-source translation

Multilingual Translation with Shared Alignment • One Encoder per source

language • One decoder per target

language • Shared Attention Mechanism

– Target hidden state, source context vector

→ Attention weight

• No multi-way parallel corpus assumed – Bilingual sentence pairs only – Each sentence pair activates/updates one encoder, decoder and shared attention

Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System

• Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of examples – Sep 2016

• GNMT extended allowing for a single system to translate between multiple languages.

• uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to.

• In addition to improving translation quality, our method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.

Zero-shot

One model to learn them all

• Multi-modal, multi-task: Text, speech, image… all converging to a common paradigm.

• If you know how to build a neural MT system, you may easily learn how to build a speech-to-text recognition system...

• Or you may train them together to achieve zero-shot AI.

– Translation without any direct parallel resource

Multimodal

Multiple Encoder/Decoder Framework

• Use several encoders and decoders – different language pairs – other Seq2Seq tasks

(speech) – sentence classification

tasks (sequence-to-category)

– image captioning (image-to-sequence)

• Force the representation to be identical for all encoders

Multilingual systems are currently used to serve 10 of the recently launched 16 language pairs in Google Translate

Applications of Multilingual Embeddings: Mine for Parallel Data

• Extract parallel data from large monolingual collections

• Approach

– Embed billions of sentences in same space

– For each sentence in one language, search the k-closest ones in another language

– Decide which sentences are possible translations based on distance: simple threshold, classifier

• NMT Marathon project

– Multilingual embeddings for sentence-level parallel corpora

– Open and highly-scalable implementation customized to

– retrieve nearby sentences

– Target: 17PB of WEB crawl data (Internet Archive)

– Start with 55 TB of CommonCrawl

The Way Forward

• Multilingual Digital Libraries are required to enable Universal Access

• Need for high-quality, high-coverage, robust Language Technologies: translation, text mining, interfaces for Indian languages

• Scarcity of resources for many languages.

• CLIR/MLIA performance depends on the availability of high-quality translation resources and language processing tools

• Find ways to acquire, maintain and update language tools and resources a necessity

The Way Forward

• Creation of Multilingual knowledge bases and knowledge graphs

• Semantic Web, ontologies, linked data, interoperability

A note of Caution

• Managing Expectations for Automatic Processing

Thank You !!

Sudeshna Sarkar IIT Kharagpur...The ultimate multilingual search system “Given a query in any...

Documents

Transcript of Sudeshna Sarkar IIT Kharagpur...The ultimate multilingual search system “Given a query in any...