NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa...
Transcript of NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa...
![Page 1: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/1.jpg)
NLP for low-resourced languages
Teresa Lynn, PhD
Research Fellow
ADAPT Centre
Dublin City University
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
![Page 2: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/2.jpg)
AI Chal lenges for
Low-resourced Languages
• Overview of The Irish Language
• NLP with few resources
• Addressing the Lack of Irish Data
• The Future?
![Page 3: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/3.jpg)
Irish language - status
C e n s u s ( 2 0 1 6 ) : P o p . 4 , 7 6 1 , 8 6 5
A b i l i t y t o s p e a k : 1 , 7 6 1 , 4 2 0
D a i l y u s a g e : 7 3 , 8 0 3
F i r s t O f f i c i a l L a n g u a g e
N a t i o n a l L a n g u a g e
![Page 4: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/4.jpg)
EU Language status
O f f i c i a l E U L a n g u a g e
M i n o r i t y L a n g u a g e ( l o w - r e s o u r c e d )
D e r o g a t i o n o n o f f i c i a l t r a n s l a t i o n s ( u n t i l 2 0 2 1 )
![Page 5: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/5.jpg)
Morphology/ Inflection
LENITION
sa cheantar ‘in the area’
airgead a thuillfeadh sé ‘money he would earn’
a dheartháir ‘his brother’
ECLIPSIS
Tír na nÓg ‘Land of the Youth’
i mBéarla ‘in English’
ar an mbord ‘on the table’
VOWEL HARMONY
Caithim `I spend’
Casaim `I turn’
Rithfinn `I would run’
D’íosfainn `I would eat’
![Page 6: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/6.jpg)
Inflected Prepositions
7
le – with liom `with me’leat `with you’
ag – at agam `at me’agat `at you’
faoi – about/under fúm ‘about/under me’fút ‘about/under you’
ó – fromuaim `from me’uait `from you’
do – todom to me’duit `to you’
ar – onorm ‘on me’ort ‘on you’
![Page 7: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/7.jpg)
Word Order
V OS
English: `I saw the boy’
Irish: Chonaic mé an buachaill
Gloss: Saw I the boy
![Page 8: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/8.jpg)
Irish language technology
3 1 E U l a n g u a g e s
L a n g u a g e r e s o u r c e s a n d t e c h n o l o g i e s
M E TA - N E T w h i t e p a p e r s e r i e s ( J u d g e e t a l . , 2 0 1 2 )
E U - l e d s u r v e y
![Page 9: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/9.jpg)
www.adaptcentre.ie
MT
9
English
good
French, Spanish
moderate fragmentary
Catalan, Dutch, German, Hungarian, Italian, Polish, Romanian
weak or no support
Basque, Bulgarian, Croatian, Czech, Danish, Estonian,
Finnish, Galician, Greek, Icelandic, Irish, Latvian,
Lithuanian, Maltese, Norwegian, Portuguese, Serbian, Slovak, Slovene, Swedish, Welsh
excellent
Czech, Dutch, Finnish, French, German, Italian, Portuguese,
Spanish
moderate fragmentary
Basque, Bulgarian, Catalan, Danish, Estonian,
Galician, Greek, Hungarian, Irish, Norwegian,
Polish, Serbian, Slovak, Slovene, Swedish
weak or no support
Croatian, Icelandic, Latvian, Lithuanian, Maltese, Romanian, Welsh
excellent
English
good
Spe
ech
English
good
Dutch, French, German, Italian, Spanish
moderate fragmentary
Basque, Bulgarian, Catalan, Czech, Danish, Finnish, Galician, Greek, Hungarian, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovene,
Swedish
weak or no supportexcellent
English
good
Czech, Dutch, French, German, Hungarian, Italian,
Polish, Spanish, Swedish
moderate fragmentary
Basque, Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish, Galician, Greek, Norwegian,
Portuguese, Romanian, Serbian, Slovak, Slovene
weak or no supportexcellent
Re
sou
rce
sTe
xt A
nal
ysis
Croatian, Estonian, Icelandic, Irish, Latvian,
Lithuanian, Maltese, Serbian, Welsh
Icelandic, Irish, Latvian, Lithuanian, Maltese, Welsh
![Page 10: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/10.jpg)
Risk of Digital Extinction
“Printing Press resulted in the extinction of many minority and regional languages”
Will technology have the same impact on Irish?
![Page 11: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/11.jpg)
Risk of Digital Extinction
Need to ensure continuing language usage through technology
o Edutainment packages/ CALLo Multi-platform Word processing toolso Automated translationo Search engineso Gameso Social media/ Online data miningo Text Generation (e.g. weather reports)o Automatic subtitlingo …
![Page 12: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/12.jpg)
Whydo we needNLP?
T E X T S U M M A R I S A T I O N
7
S E N T I M E N T A N A L Y S I S
I N F O R M A T I O N R E T R I E V A L
T E X T M I N I N G
M A C H I N E T R A N S L A T I O N
Q U E S T I O N - A N S W E R I N G S Y S T E M S
G R A M M A R C H E C K I N GL A N G U A G E L E A R N I N G A P P S
R E C O M M E N D E R S Y S T E M S V I D E O S U M M A R I S A T I O N
![Page 13: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/13.jpg)
• Overview of The Irish Language
• NLP with few resources
• Addressing the Lack of Irish Data
• The Future?
![Page 14: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/14.jpg)
Why is NLP a hard task?
One word/sentence may have many meanings
7
Many ways of saying the same thing
Meaning depends on context
Literal and figurative language (metaphor)
Language and culture
(different ways of conceptualising the same thing)
![Page 15: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/15.jpg)
8
Ambiguous Headlines
Syntactic Ambiguity
EYE DROPS OFF SHELF
SQUAD HELPS DOG BITE VICTIM
ENRAGED COW INJURES FARMER WITH AXE
STOLEN PAINTING FOUND BY TREE
PANDA MATING FAILS; VETERINARIAN TAKES OVER
SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED
POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS
Semantic Ambiguity
Source: http://www.alta.asn.au/events/altss_w2003_proc/altss/courses/somers/headlines.htm
![Page 16: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/16.jpg)
What does a machine know about language?
![Page 17: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/17.jpg)
Sentence = a string/sequence of characters:
“The man saw the boy with the telescope”
What does a machine know about language?
![Page 18: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/18.jpg)
SYNTACTIC PARSING 101
Who is doing what? Who has the telescope?
Part of Speech Tagging
“The man saw the boy with the telescope”DET NOUN VERB DET NOUN PREP DET NOUN
![Page 19: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/19.jpg)
“Traditional” Parsing
S ➔ NP VP
S ➔ NP VP PP
NP ➔ Noun | Pronoun
VP ➔ Verb NP | Verb PP
PP ➔ Preposition Noun
Noun ➔ ‘ice-cream’ | ‘summer’
Pronoun ➔ `I’
Verb ➔ `like’
Preposition ➔ ‘in’
![Page 20: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/20.jpg)
STATISTICAL PARSING
TEXT TEXT TEXT TEXT
![Page 21: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/21.jpg)
Machine Learning in NLP(data driven approaches)
STRUCTURED DATA
LABELLED DATA
RELIABLE DATA
![Page 22: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/22.jpg)
Machine Learning – data sparsity
![Page 23: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/23.jpg)
Data Envy
![Page 24: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/24.jpg)
![Page 25: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/25.jpg)
Irish Data Sparsity
FUNDINGNUMBER OF
SPEAKERSMORPHOLOGYSKILL
SHORTAGE
![Page 26: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/26.jpg)
• Overview of The Irish Language
• NLP with few resources
• Addressing the Lack of Irish Data
• The Future?
![Page 27: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/27.jpg)
Addressing the lack of data
BOOT-STRAPPING
TRAIN MORE
EXPERTS
CROSS-LINGUAL
TRANSFER
SYNTHETIC DATA
![Page 28: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/28.jpg)
CROSS-LINGUAL TRANSFER
UNIVERSAL DEPENDENCIES
MULTI-WORD EXPRESSIONS
• Using data from one language to help build a system for another
![Page 29: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/29.jpg)
BOOTSTRAPPING
PASSIVE LEARNING ACTIVE LEARNING
• Using limited data to train a sub-standard system to help further annotations (human correction rather than annotate from scratch)
![Page 30: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/30.jpg)
SYNTHETIC DATA
e.g. Back Translation for Machine Translation
![Page 31: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/31.jpg)
On that MT note…..
Tapadóir SMT system (BLEU 46)
SMT vs NMT (NMT BLEU 40)
Domain-tuning, linguistic features (hybrid)
Increasing data collection (European Language Resource Coordination)
![Page 32: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/32.jpg)
• Overview of The Irish Language
• NLP with few resources
• Addressing the Lack of Irish Data
• The Future?
![Page 33: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/33.jpg)
Linguistic Resources
CorporaKnowledge
BasesNLP Tools NLG Tools
Speech Models
Speech Synthesis
Speech Recognition
Spoken Dialogue Systems
Machine Translation
Information Retrieval
State and Public Use
CALLDisability and
Access
Synergies (Industry and
Public)
Digital Strategy for the Irish Language 2019
![Page 34: NLP for low-resourced languages - Ml Dublin.Github.Io · NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded](https://reader034.fdocuments.in/reader034/viewer/2022052611/5f04583f7e708231d40d84ce/html5/thumbnails/34.jpg)
TRAINING MORE EXPERTS
Machine Translation
Irish Twitter Analysis
Processing Irish Multiword Expressions
Irish Syntactic Parsing