Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al....

37
Building Resources for Human and Computa6onal Language Processing of Portuguese Sílvio Cordeiro, Carlos Ramisch, Marco Idiart, Rodrigo Wilkens, Leonardo Zilio, Jorge Wagner, Aline Villavicencio Federal University of Rio Grande do Sul (Brazil) Aix Marseille Université, CNRS, LIF UMR 7279 (France) University of Essex (UK)

Transcript of Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al....

Page 1: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

BuildingResourcesforHumanandComputa6onalLanguageProcessing

ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,

LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)

AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)

Page 2: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWE-awareprocessingwiththemwetoolkit

Page 3: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressionsinaNutshell•  Acombina6onofwordsthatmustbetreatedasaunitatsomelevel

oflinguis6cprocessing(Calzolarietal.,2002)o  CompoundNounso  Verb-par6cleconstruc6onso  Light-verbconstruc6onso  Idioms

Page 4: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressionsinaNutshell•  Lexical,syntac6c,seman6c,pragma6c,sta6s6cal

idiosyncrasieso  Adhoc,wineanddine(KimandBaldwin2010)

•  ArbitrarinessandIns6tu6onalisa6ono  saltandpepper,?pepperandsalt(Smadja,1993)

•  Frequencyo  Sameorderofmagnitudeaswordsinmentallexicon(Jackendoff,1997)

•  Limitedlexical,syntac6candseman6cvariabilityo  kickthebucket/?pail/?container(Sagetal.,2002)

Page 5: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWEsandNLP•  RealunderstandingrequiresMWE-awarecorpusprocessing

1.  Corpusprocessing2.  MWEdiscoveryfromcorpus(tobuildMWElexica)3.  MWErepresenta6on(inlexiconandgrammar)4.  MWEtokeniden6fica6onincorpus(toannotateMWEs)5.  MWEseman6cprocessing6.  MWEintegra6oninapplica6ons

Page 6: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

mwetoolkit•  LanguageindependentframeworkforMWEprocessing•  ExtractsMWEfromcorpora•  AnnotatescorporawithMWEs•  CalculatesAMs•  Pre-processesMWEsincorporaforDSMconstruc6on•  ImportsDSMs(word2vec,glove,PPMI)•  Providesfunc6onsforvectorcombina6ons•  Calculatescomposi6onality•  Evaluatesagainstgoldstandard

Project CAPES-COFECUB (France-Brazil)

Page 7: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Overviewofthemwetoolkit

Page 8: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWETypeDiscovery•  Candidateextrac6on

o  Pafern-basedheuris6cs(e.g.noun-noun,verb-par6cle…)

Page 9: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWE-awarecorpusprocessing•  MWEtokeniden6fica6on

o  Pafern-basedheuris6cso  Con6guous/gappyiden6fica6ono  Shortest/longest/allmatchdistanceso  Projec6ngextractedMWEtypesbackinsourcecorpus

Page 10: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Annota6onOp6ons•  CorpusbutnoMWEList:

o  GenerateMWElistfromcorpusandprojec6ngthemback•  CorpusàMWElistàAnnotatedCorpus

•  CorpusandMWEListo  Annota6onbasedonexternallistsofMWEs

•  Corpus+MWElistàAnnotatedCorpus

Page 11: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWEseman6cprocessing•  MeaningofMWEmaynotbeunderstoodfrommeaningof

individualwordso  brickwallisawallmadeofbricks,o  cheeseknifeisnotaknifemadeofcheeseàknifeforcu@ngcheese(Girjuetal.,2005).o  Loansharkisnotasharkforloanbutapersonwhooffersloansatextremelyhighinterestrates

Page 12: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Howtodetectcomposi6onality?

•  Distribu6onalSeman6cModels(DSMs)o  Posi6onwordsinmul6dimensionalseman6cspace

•  Eachword/MWErepresentedasavectorintheseman6cspaceo  Proximityinspaceindicatesseman6crelatedness

Cloud nine

Access road

Compositionality Idiomaticity

Grandfather clock

Page 13: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Howtodetectcomposi6onality?•  CosinesimilaritybetweentheMWEvectorandthesumofthe

vectorsofthecomponentwordso  Thecloservectorsarethemorecomposi6onaltheyare(Reddyetal.2011)o  cos(w1w2vector,w1vector+w2vector)

Page 14: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Distribu6onalSeman6cModels•  Techniquesandtoolsforconstruc6ngDSMs

o  Dissect(Dinuetal.,2013),Miniman6cs(Ramischetal.2013),word2vec(Mikolovetal.,2013)andGlove4(Penningtonetal.,2014).

Minimantics

word2vec

dissect

LexVec (Lexical Vectors)

Page 15: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

GoldStandardsforEvalua6on•  Rolleretal.(2013)244Germancompounds

o  around30judgmentsbycrowdsourcingscalefrom1to7

•  Farahmandetal.(2015)1,042Englishcompoundso  4expertsjudgesbinaryscalefornon-composi6onalityandconven6onality

•  Reddyetal.(2011)90Englishcompoundso  around30judgmentsbycrowdsourcingscalefrom0to5

Page 16: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

DSMsandComposi6onality•  Datasetofnominalcompoundswithhumanjudgmentsabout

literality/composi6onalityo  180compoundsforEnglish,FrenchandPortugueseo  Resourcefreelyavailable

•  hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en

Page 17: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

DSMsandComposi6onality•  DatasetofLexicalSubs6tu6onofNominalCompoundsin

Portuguese(LexSubNC)o  180compoundsforPortugueseo  Resourcefreelyavailable

•  hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en

Page 18: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Collec6ngHumanJudgments•  Judgmentswithlikertscale(0to5)

o  Forcompoundo  Forw1andw2separately

•  AgreementforPortugueseo  Forsubsetofannotators

•  α=.52forhead,•  α=.36formodifier•  α=.42forcompound

o  Sameannotatoraqer1month:0.59forcompound

Page 19: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Collec6ngHumanJudgments-Agreement

•  Greateragreementbetweenscoreforcompoundandhead(ormodifier)forextremeso  totallyidioma6candfullycomposi6onal

•  ForPTandFRcompoundscoredeterminedbyscoreoftheleastliteralword

Page 20: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Agreement•  Most/leastvaria6oninscores(average±σscore)

Page 21: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Themodels•  WaCkyCorpora(Baronietal.,2009):

o  ukWaCforEnglish(∼2billiontokens)o  frWaC(∼1.6billiontokens)forFrencho  brWaC(∼2.3billiontokens)forPortuguese(WagnerFilhoetal.2016)o  Pre-processing

•  surface+:theoriginalcorpus•  surface:withstopwordremoval.•  lemma:stopwordremovalandlemma6za6on;•  lemmaPOS:stopwordremoval,lemma6za6onandPOS-tagging

o  ContextWindowsize:1,4and8o  Dimensionsize:250,500,750

•  DSMso  PPMImodels–posi6vePMI(Miniman6cs)o  GloVe(Penningtonetal.2014)o  Word2vec(Mikolovetal2013)Skipgram,CBOWo  LexVec

French

Portuguese

English

Page 22: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ResourcesforTextSimplifica6onforPortuguese

RodrigoWilkens,LeonardoZilio,MarcoIdiart,JorgeWagnerFilho,EduardoFerreira,LuisMollmann,BiancaPasqualini,AlineVillavicencio

FederalUniversityofRioGrandedoSul(Brazil)

Page 23: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

TextSimplifica6on(TS)•  TSqualitydependentonresources

o  English:•  Corpora:

o  SimpleEnglishWikipediaparallelcorpus,withalignmentsbetweentheSimpleandtheStandardEnglishWikipedia

o  PennTreebank,Bri6shNa6onalCorpus,ukWaC•  ResourcesforLexicalSubs6tu6ons:WordNet,Roget,Moby•  GoldStandardSubs6tu6onLists:SemEvalLexicalSubs6tu6onTask

o  Portuguese:•  Corpora

o  PorSimples(Aluísioetal)•  Thesauri

o  WordNet.Pt,OntoPT,

Page 24: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

TextSimplifica6on•  Twomaintasks(Shardlow,2014):

o  lexicalsimplifica6on(LS),•  replacingcomplexexpressionswithsimplersynonyms,

o  syntac6csimplifica6on(SS)•  changethestructureofasentencebyusingsimplersyntac6cconstruc6ons

(Siddharthan,2002).o  TSwithMTtechniquesformonolingualtransla6on

•  learningalignmentsbetweensimpleandstandardsentences

Page 25: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

GeneralCorpora•  WaCky(Baronietal.2009)

o  ukWaC(Baronietal.2009)o  brWaC(Boosetal.2014,WagnerFilhoetal.2018)

•  Crawling,frommediumfrequencycontentwordsasseedso  LinguatecaCorporaFrequencyList

•  Cleaningo  HTMLandboilerplatestripping,usingdensitymetricsand

shallowtextfeatures•  Near-duplicatedetec6onandremoval

o  pairwisecomparisonofalldocuments

Page 26: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleCorpora•  ForEnglish,

o  SimpleEnglishWikipediaalignedwithEnglishWikipedia(CosterandKauchak,2011)

•  ForPortugueseo  ColeçãoÉSóoComeço(Wilkensetal.2014)

•  5booksmanuallysimplifiedbylinguists.o  Caselietal.2009

•  manuallyannotatedcorpusofsyntac6candlexicalsimplifica6onso  WikiJunior

•  illustratedbooksforchildrenupto12yearsold.o  ProjetoPorPopular(Finafoetal.2012)

•  Tabloidsforlowliteracyreaders

Page 27: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleCorpora•  WikilivrosReadabilityCorpus(WRC)

o  BooklibraryfromWikilivros•  L1:33booksfrom1stto9thgrades•  L2:65booksfrom10thto12thgrades•  L3:21booksforcollegeeduca6on

•  ReadabilityAssessedWaC(RAW)o  readabilityassessmentmodule(WagnerFilhoetal.2016)

•  intermediatemoduleofreadabilityassessment•  severalreadabilityfeaturesusedasfeaturesforclassifier

o  129,000sentencesfromL1,•  13.5wordspersentence

o  236,000sentencesfromL2•  15.2wordspersentence

o  96,000sentencesfromL3,•  17.4wordspersentence

Page 28: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressions(MWEs)

•  ForEnglish•  NomLex,WordNet•  Verb-Par6cleConstruc6ons(Baldwin2005)•  CompoundNouns(Nakov2010,Reddyetal2011,Yazdanietal2015,Ramischet

al2016)

•  ForPortuguese•  LightVerbConstruc6ons

o  Duranetal.2011•  CompoundNouns

o  NounPreposi6onNounsfromEuroparl(Zilioetal.2016)•  Parsing-based(FIPS)combinedwithSta6s6cal(PMI)

o  NounAdjec6ve(Cordeiroetal.2016)

Page 29: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleWordsLists•  Manuallycreatedlists

o  ForEnglish•  Oxford3000

o  ForPortuguese•  3,853wordsfromOxford3000transla6oncomplementedwithmostfrequent

wordsincorpora(Finafoetal.2013)

Page 30: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

LexicalSubs6tu6on•  Manualresources

o  ForEnglish•  WordNet(Fellbaum1998),Roget,Moby

o  ForPortuguese•  Onto.PT(Oliveiraetal.2010),•  OpenWN-PT(Paivaetal.2012),•  Mul6Wordnet(Brancoetal.),•  WordNet.PT(Marrafa,2002),•  WordNet.Br(DiasdaSilvaetal.,2008)

Page 31: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

LexicalSubs6tu6on•  Distribu6onalSeman6cModels

o  GloVe,word2vec,Miniman6cs,Dissect,LexVec

•  QualityEvalua6ono  ForEnglish

•  WordNet-BasedSynonymyTest(WBST)(Freitagetal.2005)•  Wordsimilarityandanalogytasks

o  ForPortuguese•  BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)

o  synonymy,antonymyandhypernymyfornounsandverbs

Page 32: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Seman6cRoleLabeling(SRL)•  Forwordsubs6tu6onincontext

o  ForEnglish•  FrameNet(Bakeretal.1998),PropBank(Kingsburyetal.2002)

o  ForPortuguese•  PropBank.Br(DuranandAluísio2012),VerbNet.Br(Scarton2013),andFrameNet

Brasil(Salomao,2009).•  VerbLexPor(Zilioetal.2016)

o  Cardiologypapersvs.Newspaperar6cles•  15,281annotatedarguments(4,192inCARDand11,089inDG)

Page 33: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ResourcesinNumbers

Page 34: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ConclusionandFutureWork•  EnglishandPortuguese

o  Differenceinmagnitudeandavailabilityofmanuallyconstructedresources

•  Alterna6ve:languageindependentmethodso  Extrapolatefrommanuallycreatedresourceso  Corpora

•  brWaC(WagnerFilhoetal.2016)•  ReadabilityAssessedWac(WagnerFilhoetal.2016,WagnerFilhoetal.2018)

o  Distribu6onalSeman6cModels•  LexVec(Salleetal.2016)

o  Goldstandards•  NCComposi6onalityDataset(Cordeiroetal.2016)•  NCLexSub(Cordeiroetal.2017)•  BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)

Page 35: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability
Page 36: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Acknowledgments•  Thisworkhasbeenfundedbythe

•  BrazilianAgencyCNPq(482520/2012-4and312114/2015-0)•  Projects“SimplificaçãoTextualdeExpressõesComplexas”,sponsoredbySamsung

EletrônicadaAmazôniaLtda.underthetermsofBrazilianfederallawNo.8.248/91,and•  FrenchAgenceNa6onalepourlaRecherchethroughprojectsPARSEME-FR(ANR-14-

CERA-0001)andORFEO(ANR-12-CORP-0005),andby•  French-Braziliancoopera6onprojectsCAMELEON(CAPES-COFECUB707/11)andAIM-

WEST(FAPERGS-INRIA1706-2551/13-7).

Page 37: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

BuildingResourcesforHumanandComputa6onalLanguageProcessing

ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,

LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)

AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)