Page 1 of 15 Please contact General Administration for any query.
Sudeshna Sarkar IIT Kharagpur...The ultimate multilingual search system “Given a query in any...
Transcript of Sudeshna Sarkar IIT Kharagpur...The ultimate multilingual search system “Given a query in any...
-
Multilingual Digital Libraries Multilingual and Crosslingual Retrieval and Access
Sudeshna Sarkar
IIT Kharagpur
UNESCO-NDL INDIA INTERNATIONAL WORKSHOP ON KNOWLEDGE ENGINEERING FOR DIGITAL LIBRARY DESIGN
-
Asd Asd Asd Asd Ad As ad
Many languages
World
• 6909 living languages
India
• India has 122 major languages (22 are constitutionally recognized) and 2371 dialects (Census 2001)
• Multiple scripts
• Population of literates
– 20% of India understand English
– 80% cannot
• Rich collection of books and materials in different Indian language
• The Divide
• The availability of resources, training
data, and benchmarks in English
leads to a disproportionate focus on
a few languages and a neglect of
many popular languages
-
Multilingualism and Universal Access • Promotion and Use of
Multilingualism
• Promotion of linguistic and cultural diversity
• Universal access to cyberspace
• Not all languages are equally represented. The universal digital library should ideally bridge this gap letting users find objects in languages different from their native one.
-
Diversity of languages in India
Kashmiri: 0.54
Nepali: 0.28 Sindhi: 0.25
Dogri: 0.22
Non-Scheduled
languages: 3.44
Konkani: 0.24
Manipuri: 0.14 Bodo: 0.13
Sanskrit: Negligable
Punjabi: 2.83
Assamese: 1.28
Oriya: 3.21
Malayalam: 3.21
Kannada: 3.69
Gujarati: 4.48
Urdu: 5.01
Tamil: 5.91
Marathi: 6.99 Telugu: 7.19 Bengali: 8.11
Santali: 0.63
Maithili: 1.18
Hindi: 41.03
Motto of NDL: Inclusivity Reach every Indian
-
Roles of Language in Digital Libraries
• Language associated with objects
– Metadata
– Content
• Language of interactions
– Such as query
• Interface language
• A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search and retrieve functions are language independent.
• Features – The user can choose the language for the
interfaces – The user can choose the interaction
language – The user can access metadata and
content in any desired language
-
Multilingual Search and Access
• Cross-Language Information Retrieval (CLIR)
– Querying multilingual collections in one language in order to retrieve documents in another language
• Multilingual Information Retrieval (MLIR)
– Process information (queries, documents, both) in multiple languages
• Multilingual Information Access (MLIA)
– Query, retrieval and presentation of information in any language
Documents
(TL)
CLIR Query
(SL)
Results (TL)
Documents
(L2)
MLIR Results
Documents
(L1 )
Documents
(L43 )
Query (𝐿𝑖)
-
The ultimate multilingual search system
“Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.”
Douglas W. Oard and David Hull, AAAI Symposium on Cross-Language IR, Spring 1997, Stanford, USA
-
Enabling Multilinguality of Data Data = Content plus metadata
• Language Identification
– Associate language tag with data.
– Automatic language identification.
• Multilingual Indexing and Presentation
– controlled vocabularies for metadata fields such as subject with different values for different languages
– Mapping of words across languages
– Automatic translation of metadata fields such as title and abstract
– Automatic translation of content
-
Enabling Effective User Interactions
• Enable querying in different languages
• Assist in query formation, query translation and query selection
• User feedback on result examination
• Assistance in Browsing and Data Translation
-
Language Technology
enabling Multilingual Digital Library
-
Normalization and spelling variation
• अगँरेजी, अगँरेजी, अगेँजी, अगेँजी, अगंरेजी, अगंरेजी, अगेंजी, अगेंजी
• अनतरराषरीय, अनतरराषरीय, अनतराररषरय, अनतरारषरीय, अतंरराषरीय, अतंरारषरीय, अतंरराषरीय, अनतराि षरय, अनतराषरीय
• পাখি, পািী • কে াা, কো
Stemming and Lemmatization
• Stemming: different morphological variations of the same word map to the same canonical form - Okay for monolingual search
• Lemmatization: Map words to their root dictionary form Needed for effective cross-lingual and multi-lingual search
Tokenization, Script rendering
Indexing, Query formation, retrieval, ranking
-
Multilingual Digital Library A Reality?
• The Divide Resource Rich Languages – Lots of resources – Annotated Resources Resource Poor Languages – Zero or less resource – Annotated resource difficult to
create
• MT is not yet good enough to translate large texts and for specialized technical domains
• Technology and NLP has made huge strides! • High quality tools developed that have pushed the accuracy of MT systems.
• Tremendous growth in data and resources.
-
Neural Networks in NLP
• Neural Network and Deep Learning State of the art in many NLP problems
– Entity and Relation finding
– Parsing
– Machine Translation
– Entity linking
• Transfer learning from resource rich to resource poor languages. Cross-lingual transfer learning enablers
– Shared representation
– Bilingual and multilingual resources
-
What is required for building Multilingual Libraries
Retrieval
• Indexing objects in multiple languages
• Enabling access through query in any language
• Challenges: dictionary coverage, synonymy and polysemy, OOV words
• Language independent indexing
Access
• Enable access in language of familiarity of user
– Machine Translation
-
Representation: Word Embedding
• Continuous vector space representation
• Represented as dense vector 𝑥𝑖 ∈ 𝑅𝑑 to
each word.
• Learn these embeddings in a way that similar words are nearby in that space.
• Capture the regularities and relationship between words.
A key secret sauce for the success of many NLP systems across tasks
Distributed Representations of Words and Phrases and their Compositionality Mikolov etal, NIPS 2013
-
Multilingual Embeddings
16
children enfants
money argent
loi law
life vie
monde world
pays country
war guerre
peace paix energy
energie
market marche
• Learn a shared embedding space between words in all languages.
• Many benefits:
– Transfer learning
– Cross lingual information retrieval
– MT
– Parallel corpus extraction
Mikolov etal, 2013; Faruqui & Dyer, 2014; etc
-
Multilingual word embedding • Resources:
– Word aligned data
– Sentence aligned data
– Dictionary
– Document Aligned Corpora
Methods:
• Monolingual mapping
– Train monolingual word embeddings and learn linear mapping or CCA
• Pseudo-cross-lingual:
– Train on pseudo-cross-lingual corpus by mixing contexts of different languages
• Cross lingual training
– Train on parallel corpus
• Joint Optimization
– jointly optimise a combination of monolingual and cross-lingual losses.
-
Learning from Parallel Corpora
• Uses monolingual datasets to learn monolingual features. • Sampled bag-of-words for each parallel sentence as the crosslingual objective. • Learns representations to model individual languages well. • Encourages representations to be similar for words that are related across two languages.
-
Linear Projection
Learn monolingual embeddings Project to a common space using a dictionary
𝑚𝑖𝑛𝑊 𝑊ℎ𝑖 − 𝑒𝑖
2
𝑛
𝑖=1
Final embedding of hi = Whi
1Mikolov et.al., Exploiting Similarities among Languages for Machine Translation
-
Cross-lingual word vector projection using CCA
• ∑′ ⊂ ∑ , ∑′ ∈ ℝ𝑛×𝑑1 Ω′ ⊂Ω,Ω ∈ ℝ𝑛×𝑑2 of words that are dictionary translations.
• Compute Canonical Co-relation Analysis (CCA) on ∑’ and Ω’
• CCA projects these vectors to a third space where they are maximally co-related
-
Learning Bilingual Word Embeddings
• Input: – 𝑒𝑆𝐿, 𝑒𝑇𝐿 (source language
embeddings)
– 𝑒𝑇𝐿 (target language embeddings)
– D (seed dictionary)
• Repeat – W←LearnMapping (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)
– D← LearnDictionary (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)
• Until stopping criteria
Monollingual embedding
Dictionary
Dictionary
Dictionary
Dictionary Mapping
Mapping
Mapping
Learning bilingual word embeddings with (almost) no bilingual data, by Mikel Artetxe Gorka Labaka Eneko Agirre, ACL 2017
-
Learning from Document-Aligned Corpora
English-Hindi-Bengali Document Aligned articles from Wikipedia
• Merge the documents to get a “pseudo-trilingual document" • Remove the sentence boundaries • Randomly shuffle the pseudo-trilingual document. • Intuition : Each word w, regardless of its actual language, obtains word
collocates from both vocabularies. • Train word2vec (skip-gram model) on this document
3Vulic. et. al. , SIGIR 2015
-
Sample Results
-
How to incorporate embeddings for IR
1. Expand query using
embeddings (followed by non-
neural IR) - Add words similar to
the query
2. IR models that work in the
embedding space - Centroid
distance, word mover’s distance
• Data – Explicit relevance
judgments
– Implicit user behaviour – e.g., click through data
-
Application to Multilingual IR
4Bhattacharya et. al., CICLing, 2016
• For OOV words
• कैं सर - cancer, disease • स्पीकर - speaker, parliament
• Example Query Translations • भारतीय ससंद में आतंकवादी हमला -
Indian Parliament constitutional terrorist assault
• आईफ़ोन आईपैड डडज़ाइन लोकप्रियता लांच - iPhone iPad popularity unveiled
Representation learning based models can estimate relevance based on semantic matches with query
-
Multilingual Clustering using Word Embeddings
• Construct a graph G = (V, E) • V = vertices, words from the languages • E = edges, exist if the similarity between the
vertices is above a particular threshold • Edges are weighted as cosine similarity value
पॉललथीन
फें कना
जूता
कचरों hurl
throw
dart bags
सफाईकमी
झपटा wash
buckets
गमले splash
toilet
• Use Louvain [Blondel et al., 2008] for cluster detection • Performs hard clustering • O(nlogn)
Cluster Formation
-
Application to Cross-Language IR Original Query Q in
source language q1q2q3 …. qn
Map qi to cluster Ck
Pick target language words
from Ck
Query translated to target language
flag
emblem president
ध्वज झंडा वन्दे
मातरम secretary
CEO
चेयरमैन
सीईओ chairperson
अध्यक्ष
পতাো প্রতীে
বন্দে
currency economics
inflation
মুদ্রাস্ফীখত
অর্থনীখত
आर्थिक मुनाफा prices
money
cost
টাো
অর্থ
িরচ
-
Results of the Clustering Approach
• “Euro” is related to sports and not economics. • No Cluster method wrongly predicts the context and suggests words like ‘banknotes”. • On the other hand, pairwise clustering understands that “cup” is related to some sports, “football” to be more
specific. • Multilingual clustering restricts to a shorter query and hence translates to only “trophy” and “cup”
-
From Word Embedding to Query/Document Embedding
• Bag of embedded words: sum or average of word vectors
– Effective only for short text
-
DSSM: Deep Structured Semantic Model
• or Deep Semantic Similarity Model.
• Represent query q and document d in a continuous semantic space
• model semantic similarity between g and d using cosine similarity.
• Force the representations – For relevant (q,d+) pairs to be close in the latent space
– Irrelevant (q, d-) pairs to be far in the latent vector space
-
Deep Structured Semantic Model (DSSM) [Huang et al., 2013]
-
Role of Knowledge Graphs in Information Retrieval
• Knowledge graphs for semantic search – Entities, attributes, types, relations, etc.
• Popular (semi)structured data sources – Dbpdia – Freebase – Yago – Conceptnet – Knowledge graphs of Google, Microsoft
• Use graph structure for relevance computation Utilizing Knowledge Bases in Text-centric Information Retrieval – Tutorial by Laura Dietz, Alexander Kotov, and Edgar Meij. In Proceedings of the Conference on Web Search and Data Mining (WSDM). 2017
-
Multilingual Knowledge Graph
• Knowledge Graph connects words and phrases of Natural Language with labeled edges.
• Knowledge Graph combined with Word Embedding
• [Speer17] use ConceptNet to build semantic spaces that are more effective than distributional semantics alone.
ConceptNet 5.5: An Open Multilingual Graph of General Knowledge Robert Speer, Joshua Chin, Catherine Havasi, AAAI 2017
-
in desired language
Access
-
Access
• Translation from language of object to desired language
• Shortcomings of MT – Great progress in sentence level MT
– Does not capture discourse level translation
– May not work on specific domain verticals not represented in g training set
– Less resourced language pairs are at a disadvantage
-
Neural Machine Translation
• An encoder processes the source sentence and creates a compact representation
• This representation is the input to the decoder which generates a sequence in the target language
• Both encoder and decoder are RNNs
Encoder Decoder Model
Source sentence
Target sentence Encoder
Decoder
• End-to-end training: All parameters are simultaneously optimized
• Distributed representations share strength
• Better exploitation of context
-
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]
A set of annotation vectors {ℎ1, ℎ2, … , ℎ𝑇} For each target word 𝑦𝑡,
1. Compute alignment score
e 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1, ℎ𝑗 , 𝑠𝑡−1
2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗
αt,T
h1 h1
x 1
αt,1
αt,2 h2 h2
x 2
αt,3 h3 h3
x 3
hT hT
x T
y t-1
st-1
+
y t
st
Attention Model for MT: Learning to Align and Translate jointly
-
NMT
-
Continuous Vector Space
• Similar sentences are close in this space
• Multiple dimensions of similarity encoded.
-
Multilingual Machine Translation
• Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation; by Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean; Transactions of the Association of Computational Linguistics – Volume 5, Issue 1, Oct 2017
• Multi-Way, Multilingual Neural Machine Translation; by Firat, Kyunghyun Cho, B sankaran, FTY Vural, Yoshua Bengio, Journal; Computer Speech and Language, Volume 45 Issue C, September 2017
-
GNMT
-
utilize more data sources
• Multi-lingual: learn from many language pairs?
• SMT-inspired: utilize monolingual data?
• Multi-task: combine seq2seq tasks?
-
Multilingual Translation
• Number of parameters grows linearly w.r.t. number of languages
• Multi-source translation
-
Multilingual Translation with Shared Alignment • One Encoder per source
language • One decoder per target
language • Shared Attention Mechanism
– Target hidden state, source context vector
→ Attention weight
• No multi-way parallel corpus assumed – Bilingual sentence pairs only – Each sentence pair activates/updates one encoder, decoder and shared attention
-
Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System
• Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of examples – Sep 2016
• GNMT extended allowing for a single system to translate between multiple languages.
• uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to.
• In addition to improving translation quality, our method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.
-
Zero-shot
-
One model to learn them all
• Multi-modal, multi-task: Text, speech, image… all converging to a common paradigm.
• If you know how to build a neural MT system, you may easily learn how to build a speech-to-text recognition system...
• Or you may train them together to achieve zero-shot AI.
– Translation without any direct parallel resource
-
Multimodal
-
Multiple Encoder/Decoder Framework
• Use several encoders and decoders – different language pairs – other Seq2Seq tasks
(speech) – sentence classification
tasks (sequence-to-category)
– image captioning (image-to-sequence)
• Force the representation to be identical for all encoders
-
Multilingual systems are currently used to serve 10 of the recently launched 16 language pairs in Google Translate
-
Applications of Multilingual Embeddings: Mine for Parallel Data
• Extract parallel data from large monolingual collections
• Approach
– Embed billions of sentences in same space
– For each sentence in one language, search the k-closest ones in another language
– Decide which sentences are possible translations based on distance: simple threshold, classifier
• NMT Marathon project
– Multilingual embeddings for sentence-level parallel corpora
– Open and highly-scalable implementation customized to
– retrieve nearby sentences
– Target: 17PB of WEB crawl data (Internet Archive)
– Start with 55 TB of CommonCrawl
-
The Way Forward
• Multilingual Digital Libraries are required to enable Universal Access
• Need for high-quality, high-coverage, robust Language Technologies: translation, text mining, interfaces for Indian languages
• Scarcity of resources for many languages.
• CLIR/MLIA performance depends on the availability of high-quality translation resources and language processing tools
• Find ways to acquire, maintain and update language tools and resources a necessity
-
The Way Forward
• Creation of Multilingual knowledge bases and knowledge graphs
• Semantic Web, ontologies, linked data, interoperability
-
A note of Caution
• Managing Expectations for Automatic Processing
-
Thank You !!