A call to action. Dr. Vukosi Marivate ... - USAf
Transcript of A call to action. Dr. Vukosi Marivate ... - USAf
![Page 1: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/1.jpg)
A call to action. Using data science in the advancement of African
Languages.
Dr. Vukosi Marivate & CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP
![Page 2: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/2.jpg)
Overview
▪ Setting the scene [Why do we care?]
▪ How did we get here?
▪ How can we tackle these challenges?
▪ Some results and future work
![Page 3: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/3.jpg)
Language, AI and MLArtificial Intelligence (AI) & Machine Learning (ML)
● Text and language is a rich interface to share information and interact with machines.
● We need to ask ourselves a few questions.○ How do machines process language
information?○ Why is local language important?
![Page 4: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/4.jpg)
Brief overview of AI/MLArtificial Intelligence (AI) & Machine Learning (ML)
Russell and NorvigManning
● AI○ Machine○ In an environment (can perceive it)○ Perform Actions○ Reach a Goal
● ML○ Learning patterns from data
● Natural Language Processing (NLP)○ Learning language tasks
![Page 5: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/5.jpg)
Challenges with African Languages Use Case: South Africa
● Lack of sufficient language resources.
● Inequality in data availability.
● Rare to find annotated datasets (for different NLP tasks) publicly.
● How can we innovate in collection, curation, annotation and classification?
![Page 6: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/6.jpg)
Why Data is Important
Where is the data?
![Page 7: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/7.jpg)
A framework to understand the challenge
Martinus and Abbott 2019 https://arxiv.org/pdf/1906.05685.pdf
● Low Availability
● Discoverability
● Focus
● Reproducibility and Benchmarks
Languages world map [Wikimedia]
![Page 8: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/8.jpg)
Importance indigenous languages
UNESCO and DSI reporting on indigenous languages
● What does language capture?○ Indigenous knowledge○ Culture
● How did we get here?○ Inequality of language○ Colonial legacies○ Move to monolingualism○ Lack of Data○ Who develops the systems
![Page 9: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/9.jpg)
Importance indigenous languages
UNESCO and DSI reporting on indigenous languages
● Internet is becoming more and more monolingual.
● How do we increase access to local populations?
● How
![Page 10: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/10.jpg)
So what are we going to do about it?Some thoughts on
Expand the current community of practice across the African Continent and Global South
Innovate on what has come before!● We have great new tools in ML/DL,
exploit them.● Expand data availability and
gathering.
![Page 11: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/11.jpg)
Questioning the Status Quo
● Civil Disobedience?
● Can we afford to wait longer?
● Tapping into our youth.
![Page 12: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/12.jpg)
How do we move forward
Martinus and Abbott 2019
● Better public understanding
● Collect, collate and annotate data
● Expanding practice and skill -Building community
![Page 13: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/13.jpg)
Current Future Directions▪ More data collection, curation and annotation
▪ See AI4D and Lacuna Data Set Challenges
▪ Ongoing research at DSFSI▪ Enhancement of ML pipelines for low resource scenarios▪ New models for augmentation▪ Dataset Curation & Pretrained Models (Masakhane isiZulu)▪ Masakhane Web Tools▪ Teaching NLP https://dsfsi.github.io/cos802/
▪ Building on our foundations▪ Masakhane Community - https://www.masakhane.io/▪ Sauti-Yetu Unconference - 10 October 2020 -
https://sites.google.com/view/sautiyetu-nlp/
![Page 14: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/14.jpg)
Low resource language dataset creation, curation
and classification: Setswana and Sepedi
Vukosi Marivate Tshephisho Sefara Vongani Chabalala
Keamogetswe Makhaya Tumisho Mokgonyane
Rethabile Mokoena Abiodun Modupe
Moseli MotsoeliMasakhane Community
![Page 15: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/15.jpg)
Idea 1: Use National Broadcaster as ResourceSouth African Broadcasting Corporation [SABC]
● SABC is South Africa's state broadcaster○ 19 Radio Stations○ 5 TV Channels○ Online digital news.
● Currently does not publish digital news in other languages except English.
● Radio stations in all 11 official languages [scripts exist, not public]
● Idea: Get headlines from Radio Facebook Pages, annotated for category classification
SEPEDI [nso]
SETSWANA [tn]
![Page 16: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/16.jpg)
Idea 1: Use National Broadcaster as Resource
SEPEDI [nso]
SETSWANA [tn]
Example Setswana Data and Annotations
Datasets available: https://zenodo.org/record/3668495
![Page 17: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/17.jpg)
Idea 1: Use National Broadcaster as Resource
SEPEDI [nso]
SETSWANA [tn]
Datasets available: https://zenodo.org/record/3668495
![Page 18: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/18.jpg)
Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages
Embeddings available: https://zenodo.org/record/3668481 (Being updated)
● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR
![Page 19: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/19.jpg)
Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages
Embeddings available: https://zenodo.org/record/3668481 (Being updated)
● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR
● Train traditional vectorisers● Train word embeddings [Word2Vec]● Train sentence embeddings
[Doc2Vec]● Useful for downstream tasks
![Page 20: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/20.jpg)
Idea 3: Text Augmentation with Quality CheckIncrease data sizes with data augmentation
● Build robust classifiers with data augmentation for text.
● Contextual augmentation using word2vec “synonyms”
● Novel: Added a quality check using doc2vec [Algorithm 1]
![Page 21: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/21.jpg)
Initial Results: BenchmarksRan through
Full paper available through RAIL LREC workshop paper https://arxiv.org/abs/2004.04813
Models:● Logistic
Regression● Support Vector
Classifier● XGBoost● MLP● Comparisons
○ TF, TFIDF, W2V○ With and without
augmentation
![Page 22: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/22.jpg)
Whats Next
![Page 23: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/23.jpg)
One More Thing
Text Augment LibraryWe release this library as part of this paper▪ https://github.com/dsfsi/textaugment
![Page 24: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/24.jpg)
MasakhaneWhy African Natural
Language Processing Now
Dr. Vukosi Marivate &Masakhane CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP
MasakhaneMachine
Translation for Africa
![Page 25: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/25.jpg)
MASAKHANE is research effort for machine translation for African
languages that is
OPEN SOURCE
CONTINENT-WIDE
DISTRIBUTED
ONLINE
![Page 26: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/26.jpg)
Online and Accessible
![Page 27: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/27.jpg)
Masakhane: The Reach
https://masakhane.io
![Page 28: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/28.jpg)
Masakhane: The Reach
https://masakhane.io
2020 so far [Workshop papers]● ∀, Masakhane - Machine Translation for Africa (2020)● Dossou, Bonaventure FP, and Chris C. Emezue. "FFR V1. 0: Fon-French Neural
Machine Translation." (2020).● Orife, Iroro. "Towards Neural Machine Translation for Edoid Languages." (2020)● Orife, Iroro, et al. "Improving Yor\ub\'a Diacritic Restoration." (2020).● Marivate, Vukosi, et al. "Investigating an approach for low resource language dataset
creation, curation and classification: Setswana and Sepedi." (2020).● Ahia, Orevaoghene, and Kelechi Ogueji. "Towards Supervised and Unsupervised Neural
Machine Translation Baselines for Nigerian Pidgin." (2020).● Van Biljon, Elan, Arnu Pretorius, and Julia Kreutzer. "On optimal transformer depth for
low-resource language translation. (2020).● Öktem, Alp, Mirko Plitt, and Grace Tang. "Tigrinya Neural Machine Translation with
Transfer Learning for Humanitarian Response." (2020)● Martinus, Laura, et al., Neural Machine Translation for South Africa's Official Languages
(2020)
![Page 29: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/29.jpg)
Masakhane: Impact
EMNLP 2020
![Page 30: A call to action. Dr. Vukosi Marivate ... - USAf](https://reader035.fdocuments.in/reader035/viewer/2022070104/62bc6d573898ce4fa644ac5f/html5/thumbnails/30.jpg)
Thank You
It takes a village (literally)
Keep in touchJoin our research group newsletter https://tinyletter.com/datascience-up/
Made with ❤ in Tshwane
Dr. Vukosi [email protected]://dsfsi.github.io@vukosi