Readability Assessment and Text Simplification for Basque ...

42
Ixa Group Basque language Readability Assessment Text Simplification Current and near future work Readability Assessment and Text Simplification for Basque in the Ixa Group Itziar Gonzalez-Dios Supervisors: Mar´ ıa Jes´ us Aranzabe and Arantza D´ ıaz de Ilarraza IXA NLP Group, University of the Basque Country (UPV/EHU) ixa.eus/Ixa @IxaGroup Pisa, 2015 Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 1/38

Transcript of Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment and TextSimplification for Basque in the Ixa Group

Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza

IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa

@IxaGroup

Pisa, 2015

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 1/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 3/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ixa Group

Research group at the University of the Basque Country(UPV/EHU)

Since 1988

64 members

10 subgroups

Computer Science Faculty of Donostia-San Sebastian

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 4/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Our philosophy

Bottom up conception (progressive development)

Reuse of resources and tools

Open source: Ixa pipes http://ixa2.si.ehu.es/ixa-pipes/

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 5/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Research lines

Creation of basic resources (linguistic resources andprocessors):

Corpora, dictionaries, ontologiesComputational lexicography, morphology, syntax, semantics,pragmatics and discourse

Operational aspects (integration of language tools):

Corpus processingParallel processingCorpus annotation

Language technology applications:

Information extraction and question answeringMachine translationLanguage teaching/learning

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 6/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Research lines

Projects (now):

European: 4National: 4Regional: 3

PhD thesis:

In progress: 19Done: 38

Languages:

Mainly, BasqueEnglish, SpanishQuechua

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 7/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 8/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Basque language

Origin:

Pre Indo European LanguageIsolated

Today, 5 dialects + standard (+ 2 almost lost, + another onedocumented)

Geographical domain:

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 9/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Sociolinguistic info

800.000 native speakers; 1 million understand/speak something

Official: Araba, Bizkaia and Gipuzkoa (the Autonomous Communityof the Basque Country); The north of Navarre

Not official: Lapurdi, Behe-Nafarroa and Zuberoa (Together withBearn, Pyrenees-Atlantiques); Navarre

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 10/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Typology

Agglutinative; Case system: ergative-absolutive; 18 case endings

Head final; Free word order at sentence level

6 vowels, 25 consonants

Example sentences

(1) Mutilakboy-erg

sagarraøboy-abs

janeat-prf

du.aux-3sgerg3sgabs.prs.ind

’The boy has eaten an apple.’

(2) Sagardoaøcider-abs

dastatzekotaste-ven.adn

prestøready-abs

dagoeneanstare-3sgabs.prs.comp.loc

irekitzenopen-ipf

dirabe-3plabs.prs

sagardotegiakø,cider-house-pl.abs,

normaleannormal-loc

urtarrilarenjanuary-gen

20tik20-abl

Aste Santura.eastern-adl

’Cider houses open when the cider is ready to taste, usually from the 20th ofJanuary to Eastern.’

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 11/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 12/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment

IAS

Essay scoring system

ErreXail

Simple vs. complex

Ion Madrazo’s work

B1, B2, C1, C2

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 13/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

IAS

Idazlanen Autoebaloaziorako Sistema (IAS) Auto-evaluation ofessays (Castro-Castro et al., 2008)

Clause number in a sentenceTypes of sentences (questions, negations...)Clause types (temporal, causal...)PoS typesLemma number

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 14/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail

Readability assessment system (Gonzalez-Dios et al., 2014)

measures 96 ratio based on linguistic informationuses Machine Learning techniquescollected two corpora of scientific divulgation for adults and children

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 15/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Linguistic features

Global features: sentence length, word length, sentence number (3ratios)

Lexical features: PoS, lemmas, named entities... (39 ratios)

Morphological features: case markers, verb types, verbmorphology... (24 ratios)

Morphosyntactic features: noun phrases, verb phrases,appositions (5 ratios)

Syntactic features: subordinate clauses (10 ratios)

Pragmatic features: types of connectors and conjunctions (12ratios)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 16/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Classification results

Experiments AccuracyAll features 89.50

Lexical features 90.75Lex+Morph+Morph-sint+Sintax 93.50

Table: Classification results with SMO and 10 fold cross-validation

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 17/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Most predictive features

Features and groups Relevance (InfoGain)Proper nouns / common nouns ratio (Lex.) 0.2744

Appositions / noun phrases ratio (Morpho-synt.) 0.2529Appositions / all phrases ratio (Morpho-synt.) 0.2529Named entities / common nouns ratio (Lex.) 0.2436Unique lemmas / all the lemmas ratio (Lex.) 0.2394

Acronyms / all the words ratio (Lex.) 0.2376Causative verbs / all the verbs ratio (Lex.) 0.2099

Modal-temporal clauses / subordinate clauses ratio (Synt.) 0.2056Destinative case endings / all the case endings ratio (Morph.) 0.1968Connectors of clarification / all the connectors ratio (Prag.) 0.1957

Table: Most predictive features

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 18/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ion Madrazo’s master thesis (2014)

More linguistic features

DependenciesDepth of the syntactic treeN-gramms at PoS and dependency levelUse of synonymsLatent Semantic Analysis

Other ML techniques

Algorithms to choose the features (Information Gain and CorrelationFeature Selection)Meta algorithms for classification (Ordinal Classification and CostSensitive Learning)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 19/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ion Madrazo’s master thesis: Results

Most significant features for each level (B1, B2, C1, C2)

Best results with multinomial Naive Bayes -> % 61.69 accuracy

State-of-the-art results

Similar results with meta algorithms

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 20/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 21/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Lexical simplification

Begun in 2015

Maria Eguimendia’s work for her master thesis

Resources:

A list of lemma frequency from the Corpus Lexikoaren Behatokia(41.773.391 words)Basque WordNetUKB (Word Sense Disambiguation)NAF as input (multilingual)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 22/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Syntactic Simplification

Begun in 2011

Two main lines:

Linguistic analysis of complex sentences to propose simplificationrulesDeveloping or adapting the tools to perform the automaticsimplification (architecture of the EuTS system)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 23/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Resources

Corpora:

Reference Corpus for the Processing of BasqueConsumer Corpus (used in Machine Translation)WikipediaElhuyar Corpus (scientific divulgation)

Grammar:

Descriptive Grammar of Basque by Euskaltzaindia (Academy)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 24/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Tasks

Analysis of each clause type to propose simplification rules

Define a simplification process

Analysis of the frequency and position of each adverbial structurefound in the grammar

Check if the proposed rules are also valid in other domains

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 25/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Simplification process

Spliting: Make as many new sentences as clauses out of the original

Reconstruction: Two operations take place:

Removing no longer needed morphological featuresAdding adverbs or phrases to maintain the meaning

Reordering: Reorder the elements in the new sentences, andordering the sentences in the text

Correction: Correct the possible grammar and spelling mistakes,and fix punctuation and capitalisation

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 26/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Example

Simplification proposal of concessive clauses

(3) a. Hasiera batean aste honetan partidurik ez jokatzea aurreikusita zegoenarren, azken orduan ostiralean partidu bat jokatu nahi izan du Athleticek.(Although it was not foreseen to play a match this week, at the lastmoment Athletic Bilbao has decided to play one on Friday.)

b. i. Hasiera batean aste honetan partidarik ez jokatzea aurreikusita zegoen.(It was not foreseen to play a match this week.)

ii. Hala ere, azken orduan ostiralean partida bat jokatu nahi izan duAthleticek. (However, at the last moment Athletic Bilbao has decidedto play one on Friday.)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 27/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Simplification levels

1 Syntactic Substitution Simplification (SSS): Frequency basedsimplification of syntactical structures

2 Natural Simplification (NS): Compound and complex sentenceswith finite verbs simplification will follow the simplification processtogether with the SSS

3 Strong or absolute simplification (AS): Everything is simplified(finite and non finite verbs + SSS)

4 Tailored or customised simplification (CS): Only needed orrequired phenomena

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 28/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Syntactic Substitution Simplification (SSS)

-tzearren ‘(in order) to’: % 1.86 (15 instances)

-tzeko ‘(in order) to’: % 88.38 (791 instances)

SSS of non finite purpose clauses

(4) a. Abuztuaren amaieran beste goi bilera bat egitea aztertzen ari diraIsrael eta PAN Palestinako Aginte Nazionala, Ekialde Erdiko bakeprozesua suspertzearren. (Israel and the PNA, Palestinian NationalAuthority, are studying to organise another summit at the end ofAugust to promote the peace process in the Middle East.)

b. i. Abuztuaren amaieran beste goi bilera bat egitea aztertzen aridira Israel eta PAN Palestinako Aginte Nazionala, Ekialde Erdikobake prozesua suspertzeko.

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 29/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Architecture of the EuTS system

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 30/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Architecture of the EuTS system: Developed or adaptedtools

Improved the clause boundary detection grammar

Developed an apposition detector

Developed a readability assessment system ErreXail

Implemented a splitting algorithm (and reconstruction for therelative clauses)

A tool that simplifies biographical data (multilingual) Biografix

SSS

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 31/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Biografix: Example

Living people (original)

(5) Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, (Beasain, Gipuzkoa, 1948ko irailaren 6a) sukaldari, aktoreeta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, (Beasain, Gipuzkoa, 6th September, 1948) is a basque chef, actorand businessman.’

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 32/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Biografix: Example

Living people (simplified)

(6) a. Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, sukaldari, aktore eta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, is a basque chef, actor and businessman.’

b. Karlos Arginano 1948ko irailaren 6an Beasainen jaio zen.’Karlos Arginano was born on the 6th of September, 1948 in Beasain.’

c. Beasain Gipuzkoan dago.’Beasain is in Gipuzkoa.’

Available at http://ixa.si.ehu.es/Ixa/Produktuak/1403535629https://github.com/itziargd/Biografix

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 33/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Evaluation

Extrinsic and manual evaluation of Biografix

Manual evaluation of SSS

Planed evaluations:

Compare our rules to various approaches of simplification (Corpus ofSimplified Text)Extrinsic evaluation through machine translation (which translator?)Comprehension tests (crowdsourcing platforms)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 34/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Corpus of Simplified Text

First phase:

3 texts of scientific divulgation (medicine, technology and history)3 annotators (different backgrounds)

A court translator with no idea about simplificationA teacher of Basque as foreign languageA philosoph/writer that writes literature in easy Basque (intuitive)

Which operations do they perform?Do they make common operations?Are those operations similar to ours?

Second phase: other domains

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 35/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 36/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Current and near future work

Implementation of EuTS:

Adaptation of the analysis output for the morphology generatorFormalisation of the rules written after the linguistic analysis

Waiting for the annotators of the Corpus of Simplified Text ->Analysis of the operations

Exploring the other evaluation possibilities

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 37/38

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment and TextSimplification for Basque in the Ixa Group

Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza

IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa

@IxaGroup

Pisa, 2015

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 38/38