Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al....
Transcript of Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al....
BuildingResourcesforHumanandComputa6onalLanguageProcessing
ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,
LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)
AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)
MWE-awareprocessingwiththemwetoolkit
Mul6wordExpressionsinaNutshell• Acombina6onofwordsthatmustbetreatedasaunitatsomelevel
oflinguis6cprocessing(Calzolarietal.,2002)o CompoundNounso Verb-par6cleconstruc6onso Light-verbconstruc6onso Idioms
Mul6wordExpressionsinaNutshell• Lexical,syntac6c,seman6c,pragma6c,sta6s6cal
idiosyncrasieso Adhoc,wineanddine(KimandBaldwin2010)
• ArbitrarinessandIns6tu6onalisa6ono saltandpepper,?pepperandsalt(Smadja,1993)
• Frequencyo Sameorderofmagnitudeaswordsinmentallexicon(Jackendoff,1997)
• Limitedlexical,syntac6candseman6cvariabilityo kickthebucket/?pail/?container(Sagetal.,2002)
MWEsandNLP• RealunderstandingrequiresMWE-awarecorpusprocessing
1. Corpusprocessing2. MWEdiscoveryfromcorpus(tobuildMWElexica)3. MWErepresenta6on(inlexiconandgrammar)4. MWEtokeniden6fica6onincorpus(toannotateMWEs)5. MWEseman6cprocessing6. MWEintegra6oninapplica6ons
mwetoolkit• LanguageindependentframeworkforMWEprocessing• ExtractsMWEfromcorpora• AnnotatescorporawithMWEs• CalculatesAMs• Pre-processesMWEsincorporaforDSMconstruc6on• ImportsDSMs(word2vec,glove,PPMI)• Providesfunc6onsforvectorcombina6ons• Calculatescomposi6onality• Evaluatesagainstgoldstandard
Project CAPES-COFECUB (France-Brazil)
Overviewofthemwetoolkit
MWETypeDiscovery• Candidateextrac6on
o Pafern-basedheuris6cs(e.g.noun-noun,verb-par6cle…)
MWE-awarecorpusprocessing• MWEtokeniden6fica6on
o Pafern-basedheuris6cso Con6guous/gappyiden6fica6ono Shortest/longest/allmatchdistanceso Projec6ngextractedMWEtypesbackinsourcecorpus
Annota6onOp6ons• CorpusbutnoMWEList:
o GenerateMWElistfromcorpusandprojec6ngthemback• CorpusàMWElistàAnnotatedCorpus
• CorpusandMWEListo Annota6onbasedonexternallistsofMWEs
• Corpus+MWElistàAnnotatedCorpus
MWEseman6cprocessing• MeaningofMWEmaynotbeunderstoodfrommeaningof
individualwordso brickwallisawallmadeofbricks,o cheeseknifeisnotaknifemadeofcheeseàknifeforcu@ngcheese(Girjuetal.,2005).o Loansharkisnotasharkforloanbutapersonwhooffersloansatextremelyhighinterestrates
Howtodetectcomposi6onality?
• Distribu6onalSeman6cModels(DSMs)o Posi6onwordsinmul6dimensionalseman6cspace
• Eachword/MWErepresentedasavectorintheseman6cspaceo Proximityinspaceindicatesseman6crelatedness
Cloud nine
Access road
Compositionality Idiomaticity
Grandfather clock
Howtodetectcomposi6onality?• CosinesimilaritybetweentheMWEvectorandthesumofthe
vectorsofthecomponentwordso Thecloservectorsarethemorecomposi6onaltheyare(Reddyetal.2011)o cos(w1w2vector,w1vector+w2vector)
Distribu6onalSeman6cModels• Techniquesandtoolsforconstruc6ngDSMs
o Dissect(Dinuetal.,2013),Miniman6cs(Ramischetal.2013),word2vec(Mikolovetal.,2013)andGlove4(Penningtonetal.,2014).
Minimantics
word2vec
dissect
LexVec (Lexical Vectors)
GoldStandardsforEvalua6on• Rolleretal.(2013)244Germancompounds
o around30judgmentsbycrowdsourcingscalefrom1to7
• Farahmandetal.(2015)1,042Englishcompoundso 4expertsjudgesbinaryscalefornon-composi6onalityandconven6onality
• Reddyetal.(2011)90Englishcompoundso around30judgmentsbycrowdsourcingscalefrom0to5
DSMsandComposi6onality• Datasetofnominalcompoundswithhumanjudgmentsabout
literality/composi6onalityo 180compoundsforEnglish,FrenchandPortugueseo Resourcefreelyavailable
• hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en
DSMsandComposi6onality• DatasetofLexicalSubs6tu6onofNominalCompoundsin
Portuguese(LexSubNC)o 180compoundsforPortugueseo Resourcefreelyavailable
• hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en
Collec6ngHumanJudgments• Judgmentswithlikertscale(0to5)
o Forcompoundo Forw1andw2separately
• AgreementforPortugueseo Forsubsetofannotators
• α=.52forhead,• α=.36formodifier• α=.42forcompound
o Sameannotatoraqer1month:0.59forcompound
Collec6ngHumanJudgments-Agreement
• Greateragreementbetweenscoreforcompoundandhead(ormodifier)forextremeso totallyidioma6candfullycomposi6onal
• ForPTandFRcompoundscoredeterminedbyscoreoftheleastliteralword
Agreement• Most/leastvaria6oninscores(average±σscore)
Themodels• WaCkyCorpora(Baronietal.,2009):
o ukWaCforEnglish(∼2billiontokens)o frWaC(∼1.6billiontokens)forFrencho brWaC(∼2.3billiontokens)forPortuguese(WagnerFilhoetal.2016)o Pre-processing
• surface+:theoriginalcorpus• surface:withstopwordremoval.• lemma:stopwordremovalandlemma6za6on;• lemmaPOS:stopwordremoval,lemma6za6onandPOS-tagging
o ContextWindowsize:1,4and8o Dimensionsize:250,500,750
• DSMso PPMImodels–posi6vePMI(Miniman6cs)o GloVe(Penningtonetal.2014)o Word2vec(Mikolovetal2013)Skipgram,CBOWo LexVec
French
Portuguese
English
ResourcesforTextSimplifica6onforPortuguese
RodrigoWilkens,LeonardoZilio,MarcoIdiart,JorgeWagnerFilho,EduardoFerreira,LuisMollmann,BiancaPasqualini,AlineVillavicencio
FederalUniversityofRioGrandedoSul(Brazil)
TextSimplifica6on(TS)• TSqualitydependentonresources
o English:• Corpora:
o SimpleEnglishWikipediaparallelcorpus,withalignmentsbetweentheSimpleandtheStandardEnglishWikipedia
o PennTreebank,Bri6shNa6onalCorpus,ukWaC• ResourcesforLexicalSubs6tu6ons:WordNet,Roget,Moby• GoldStandardSubs6tu6onLists:SemEvalLexicalSubs6tu6onTask
o Portuguese:• Corpora
o PorSimples(Aluísioetal)• Thesauri
o WordNet.Pt,OntoPT,
TextSimplifica6on• Twomaintasks(Shardlow,2014):
o lexicalsimplifica6on(LS),• replacingcomplexexpressionswithsimplersynonyms,
o syntac6csimplifica6on(SS)• changethestructureofasentencebyusingsimplersyntac6cconstruc6ons
(Siddharthan,2002).o TSwithMTtechniquesformonolingualtransla6on
• learningalignmentsbetweensimpleandstandardsentences
GeneralCorpora• WaCky(Baronietal.2009)
o ukWaC(Baronietal.2009)o brWaC(Boosetal.2014,WagnerFilhoetal.2018)
• Crawling,frommediumfrequencycontentwordsasseedso LinguatecaCorporaFrequencyList
• Cleaningo HTMLandboilerplatestripping,usingdensitymetricsand
shallowtextfeatures• Near-duplicatedetec6onandremoval
o pairwisecomparisonofalldocuments
SimpleCorpora• ForEnglish,
o SimpleEnglishWikipediaalignedwithEnglishWikipedia(CosterandKauchak,2011)
• ForPortugueseo ColeçãoÉSóoComeço(Wilkensetal.2014)
• 5booksmanuallysimplifiedbylinguists.o Caselietal.2009
• manuallyannotatedcorpusofsyntac6candlexicalsimplifica6onso WikiJunior
• illustratedbooksforchildrenupto12yearsold.o ProjetoPorPopular(Finafoetal.2012)
• Tabloidsforlowliteracyreaders
SimpleCorpora• WikilivrosReadabilityCorpus(WRC)
o BooklibraryfromWikilivros• L1:33booksfrom1stto9thgrades• L2:65booksfrom10thto12thgrades• L3:21booksforcollegeeduca6on
• ReadabilityAssessedWaC(RAW)o readabilityassessmentmodule(WagnerFilhoetal.2016)
• intermediatemoduleofreadabilityassessment• severalreadabilityfeaturesusedasfeaturesforclassifier
o 129,000sentencesfromL1,• 13.5wordspersentence
o 236,000sentencesfromL2• 15.2wordspersentence
o 96,000sentencesfromL3,• 17.4wordspersentence
Mul6wordExpressions(MWEs)
• ForEnglish• NomLex,WordNet• Verb-Par6cleConstruc6ons(Baldwin2005)• CompoundNouns(Nakov2010,Reddyetal2011,Yazdanietal2015,Ramischet
al2016)
• ForPortuguese• LightVerbConstruc6ons
o Duranetal.2011• CompoundNouns
o NounPreposi6onNounsfromEuroparl(Zilioetal.2016)• Parsing-based(FIPS)combinedwithSta6s6cal(PMI)
o NounAdjec6ve(Cordeiroetal.2016)
SimpleWordsLists• Manuallycreatedlists
o ForEnglish• Oxford3000
o ForPortuguese• 3,853wordsfromOxford3000transla6oncomplementedwithmostfrequent
wordsincorpora(Finafoetal.2013)
LexicalSubs6tu6on• Manualresources
o ForEnglish• WordNet(Fellbaum1998),Roget,Moby
o ForPortuguese• Onto.PT(Oliveiraetal.2010),• OpenWN-PT(Paivaetal.2012),• Mul6Wordnet(Brancoetal.),• WordNet.PT(Marrafa,2002),• WordNet.Br(DiasdaSilvaetal.,2008)
LexicalSubs6tu6on• Distribu6onalSeman6cModels
o GloVe,word2vec,Miniman6cs,Dissect,LexVec
• QualityEvalua6ono ForEnglish
• WordNet-BasedSynonymyTest(WBST)(Freitagetal.2005)• Wordsimilarityandanalogytasks
o ForPortuguese• BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)
o synonymy,antonymyandhypernymyfornounsandverbs
Seman6cRoleLabeling(SRL)• Forwordsubs6tu6onincontext
o ForEnglish• FrameNet(Bakeretal.1998),PropBank(Kingsburyetal.2002)
o ForPortuguese• PropBank.Br(DuranandAluísio2012),VerbNet.Br(Scarton2013),andFrameNet
Brasil(Salomao,2009).• VerbLexPor(Zilioetal.2016)
o Cardiologypapersvs.Newspaperar6cles• 15,281annotatedarguments(4,192inCARDand11,089inDG)
ResourcesinNumbers
ConclusionandFutureWork• EnglishandPortuguese
o Differenceinmagnitudeandavailabilityofmanuallyconstructedresources
• Alterna6ve:languageindependentmethodso Extrapolatefrommanuallycreatedresourceso Corpora
• brWaC(WagnerFilhoetal.2016)• ReadabilityAssessedWac(WagnerFilhoetal.2016,WagnerFilhoetal.2018)
o Distribu6onalSeman6cModels• LexVec(Salleetal.2016)
o Goldstandards• NCComposi6onalityDataset(Cordeiroetal.2016)• NCLexSub(Cordeiroetal.2017)• BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)
Acknowledgments• Thisworkhasbeenfundedbythe
• BrazilianAgencyCNPq(482520/2012-4and312114/2015-0)• Projects“SimplificaçãoTextualdeExpressõesComplexas”,sponsoredbySamsung
EletrônicadaAmazôniaLtda.underthetermsofBrazilianfederallawNo.8.248/91,and• FrenchAgenceNa6onalepourlaRecherchethroughprojectsPARSEME-FR(ANR-14-
CERA-0001)andORFEO(ANR-12-CORP-0005),andby• French-Braziliancoopera6onprojectsCAMELEON(CAPES-COFECUB707/11)andAIM-
WEST(FAPERGS-INRIA1706-2551/13-7).
BuildingResourcesforHumanandComputa6onalLanguageProcessing
ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,
LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)
AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)