Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards...

26
Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart & Karim Bouzoubaa Arabic Language Engineering and Learning Modeling ALELM Lab Mohammed V University in Rabat - Morocco ICALP 2019

Transcript of Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards...

Page 1: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Towards Automatic

Normalizaton of the Moroccan

Dialectal User Generated Text

Ridouane Tachicart & Karim Bouzoubaa

Arabic Language Engineering and Learning Modeling – ALELM Lab

Mohammed V University in Rabat - Morocco

ICALP 2019

Page 2: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Social media provides rich information of human interaction and collective behavior

Related worksProblem

DescriptionProposed

Approach

Experimental

ResultsSummaryMotivation

Social media useRich informationUser timeHuge amont of text

Market Trend

others

Moroccan Arabic text

Social media increasing use. 17M active users in MoroccoUsers spend much time in social media platforms (average of 3h per day)Huge amount of shared text (UGT)73% of Moroccan UGT is written in Moroccan ArabicMarket trend is based on the content of the online news articles, sentiments, and events

Several opportunities to understand consumer through text analysis (promote products, reach potential consumers…)

NLP opportunities

Page 3: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Related works Proposed

Approach

Experimental

ResultsSummary

Problem

DescriptionMotivation

How to automatically examine Moroccan social media

text in order to generate

new and useful information ?

PreProcessing

Machine

translation

Sentiment

Analysis

Morphological

Analysis

Topics

detection

social media text

NLP

apps

Page 4: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

❑ Noisy data *

❑ Spelling normalization

❑ Arabizi

Related works Experimental

ResultsSummary

Problem

DescriptionMotivation Proposed

Approach

* 37% of Moroccan User generated Text is noisyTachicart & Bouzoubaa, 2019. An Empirical Analysis of Moroccan Dialectal User-Generated Text. ICCCI, Hendaye 2019

Problems

Page 5: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

System Authors Description Claimed

accuracy

CODA Habash et al. Writting standards for Arabic dialects -

CODAFy Eskander et al. (tool) converts EGY to CODA -

Tuni CODA Boujelbane et al. (tool) converts TUN to CODA 86%

MADARi Obeid et al. (tool) annotation & spelling correction of GULF -

UGT Afli et al. (tool) error correction system for Arabic UGT

machine translation

68%

Different solutions are proposed for the spelling inconsistency

cannot be extended to Moroccan

Moroccan Arabic has not been targeted yet.

Experimental

ResultsSummary

Problem

DescriptionMotivation Proposed

ApproachRelated works

Existing spelling normalization Works

Page 6: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Language Identification

Morphological Analysis

Normalized Text

Spelling

Normalization

MDA?NO

Moroccan

vocabulary

Machine

translation

Sentiment

Analysis

Morphological

Disambiguation

Topics

detection

Rule based Approach

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach

Preprocessing

Page 7: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Moroccan Arabic Morphology

ProcliticPrefixLemmaSuffixEnclitic

وماكانخدموهاش

Stem

We do not process it

Word

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach

Page 8: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Concatenative Morphology Templatic Morphology

morpheme

كان

lemma

خدم

morpheme

وها

root

خدم

pattern

كانفعلوهاكانخدموها

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach

Page 9: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Language Identification

Morphological Analysis

Normalized Text

Spelling

Normalization

MDA?NO

Moroccan

vocabulary

Machine

translation

Sentiment

Analysis

Morphological

Disambiguation

Topics

detection

Preprocessing

Concatenative Morphology

morpheme

كان

lemma

خدم

morpheme

وها

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach

Page 10: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Moroccan Reference Vocabulary

affixes + clitics

lexicon of lemmas

rules

Generator

concatenation orthography

compatibility

Related worksProblem

DescriptionExperimental

ConditionsSummaryMotivation Proposed Approach

MRV4.5M words

Page 11: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Language Identification

Morphological Analysis

Normalized Text

Spelling

Normalization

MDA?NO

Moroccan

vocabulary

Machine

translation

Sentiment

Analysis

Morphological

Disambiguation

Topics

detection

Rule based Approach

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach

Preprocessing

Page 12: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Comparison & Results

Related worksProblem

DescriptionExperimental

ConditionsSummaryMotivation Proposed Approach

1

2

A

B

C

Moroccan vocabulary

Moroccan + MSA + NE

Vocabulary

automatic noise removal

manual normalization

UGT (700k words)

Page 13: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Spelling NormalizationInput word

?

Cleaned word

Levenshtein

distance measure

Candidate

words

yes

Noise removal

Lookup

algorithm

Related worksProblem

DescriptionExperimental

ConditionsSummaryMotivation Proposed Approach

Spelling Normalizer web interface

no

Moroccan + MSA + NE

Vocabulary

Page 14: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Spelling Normalization Evaluation

Recall Precision F-measure

Proposing

candidates

50% 69% 58%

Related worksProblem

DescriptionExperimental

ConditionsSummaryMotivation Proposed Approach

Test Corpus

3682 words

400 sentences

=

Page 15: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

❑ Investigated the problem of text normalization

❑ Formalized the problem as a task of UGT standardization that involves the building of a reference vocabulary

❑ Proposed an approach to spelling normalization

❑ Spelling Normalization is a crucial stage towards processing the MDA UGT

Related worksProblem

Description

Proposed

ApproachMotivation Experimental

SetupSummary

Page 16: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Perspectives

Language Identification

Normalized Text

MDA?NO

Moroccan

vocabulary

Preprocessing

Spelling

Normalization

Arabic + Arabizi

Morphological

Analysis and

disambiguation

Resources

Morphological Analysis

Normalized

corpus

Page 17: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Questions

Page 18: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Demo

Page 19: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

back

Diap

ositi

ve

11

Page 20: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

back

Page 21: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

back

Concatenation constraints

Page 22: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

back

Affixes & clitics

Page 23: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart
Page 24: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

كتب

كتاب

مكتب

كاتب

كتب

To write

A book

An office

A writer

books

affixes + clitics

Page 25: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

back

Page 26: Towards Automatic Normalizaton of the Moroccan Dialectal User … · 2019. 10. 28. · Towards Automatic Normalizaton of the Moroccan Dialectal User Generated Text Ridouane Tachicart

Vocabulary Evaluation

Verbs Nouns Particles

# words 932 1453 615

OOV 11% 23% 7%

Related worksProblem

DescriptionExperimental

ResultsSummaryMotivation Proposed Approach