Commonsense for Machine Intelligence: Text to Knowledge...

33
Part 2: Detecting and Correcting Odd Collocations in Text 1 Commonsense for Machine Intelligence: Text to Knowledge and Knowledge to Text

Transcript of Commonsense for Machine Intelligence: Text to Knowledge...

Part2:DetectingandCorrectingOddCollocationsinText

1

CommonsenseforMachineIntelligence:TexttoKnowledgeandKnowledgetoText

IntroductiontoCollocations

• Correctnativespeakerexpressioninagivenlanguage

• Strongtea(notpowerfultea)• Clearsky(notpuresky)• Gohome(notgotohome)• Gotoschool(notgoschool)• Housearrest(notarresthouse)• Friendcircle(notcirclefriend)

2

CollocationErrorsorOddCollocations

• Expressionsthatmaybegrammaticallycorrect,nottypicalamongnativespeakers• Redmeat&whitemeatarecorrectcollocationsinEnglish• TheirliteraltranslationsareoddcollocationsinGerman• NotusuallyusedbyDeutschespeakers• Machinetranslationcanoftencausesuchcollocationerrors• Canbeduetolackofcommonsense&worldknowledge

3

CollocationsandIdioms

• Somecollocationsareidiomaticexpressions:“couchpotato”• Literalidiomtranslationmaybetotallyabsurd:“sofapotato”• Note:Correctidiomusage&translationisharder

• Allcollocationsarenotidioms,e.g.,“fastcars”(vs“quickcars”)• Yet,correctcollocationusageisimportantinmanysituations

4

MotivationtoAddressCollocations– DailyCommunication

• Touristwants“blackcoffee”(regularcoffeewithoutmilk)inacoffeeshop• Asksfor“darkcoffee”usingonlinetranslationhelp• Serverbringscoffeewithmilk,madewithdarkestcoffeebeansavailable• Thisisnotwhatthetouristintended…• Whatifheislactoseintolerant?

• Note:“CoffeeShop”inAmsterdammightmeansomethingcompletelydifferentJ Aplacefordrugs!• Importanttoaddresscollocationswithcommonsense&worldknowledge

5

MotivationtoAddressCollocations– WrittenTexts

• ClassicBiblequotealsoinShakespeare’sHamlet

• Literalmachinetranslationcanyielddifferentmeaning!• Collocationse.g.,“willingspirit”&“weakflesh”mustbetranslatedwithcommonsense&referencetocontext

6

MotivationtoAddressCollocations– SearchEngines

• Oddcollocation“quickcars”returnsfewerhits& lessappropriateresults• Correctcollocation“fastcars”showsbettersite&imagesofcarsasgoodsearchresults• Machinetranslationhelpforsearchenginesshouldfixcollocationerrors

7

TechniquestoAddressOddCollocations

• TreatmentofCollocations• Differenttypesoddlycollocatedterms• Examplesofeachtypewithproblemscaused

• LinguisticClassification• Classifyingtermsascorrectvsincorrectcollocations• Consideringassociations/usingsourcelanguage

• DetectionandCorrection• Findingvariousincorrectlycollocatedtermsusingfrequencyetc.• Providingcorrectresponses,similaritymeasures,rankingthesuggestions

8

TreatmentofCollocations

• Collocationsaretypicallytreatedindifferentcategories• InsertionErrors:addingawrongterm• DeletionErrors:omittingarequiredterm• TranspositionErrors:changingorderofterms• SubstitutionErrors:usingoneterminsteadofanother

• Webrieflydescribeeachtypewithexamplesandtheproblemstheycouldcause

9

InsertionErrors• Theseincludeaddingatermnotappropriateinacorrectnativespeakerexpression

“Iwentto home” vs“Iwenthome”

“Whenwillyoureturnbackfrom Singapore?”vs“WhenwillyoureturnfromSingapore?”

“Takeabreakforthelunch”vs“Takeabreakforlunch”

• Articleerrorsquitecommoninthiscategory(addingunnecessaryarticles)• Manyoftheseerrorsinvolvegrammaticalmistakes• Thesetypesoferrorscreateproblemsin

• Fluencyofspeechespeciallyatformalevents• Clarityofwrittendocuments

10

DeletionErrors• Thesearetheoppositeofinsertionerrors&involvemissingatermneededinanexpression

“Einsteinwasscientist”vs“Einsteinwasascientist”

“Hiresomeonetodojob”vs“Hiresomeonetodothejob”

“Letuswaither”vs“Letuswaitforher”

• Theyalsocreatesimilarproblemswithrespecttofluencyandclarity• Manydeletionerrorsalsopertaintoodduseofarticles(omittinganecessaryone)• Approachesintheliteratureforarticleerrortreatmentareapplicablehere• Thesealsooftenpertaintogrammaticalmistakes 11

TranspositionErrors

• Theseerrorsoccurwhentermsarenotplacedintheappropriateorder• Theycouldbemoreproblematicthaninsertion&deletionerrors

“Don’ttalkwithyourfullmouth”vs“Don’ttalkwithyourmouthfull”

“Howtomakefriendshipsclose”vs“Howtomakeclosefriendships”

• Theymightconveythewrongmeaning,e.g.,talkingwithyourfullmouthisdifferentfromtalkingwithyourmouthfull• Sometimesit’salmosttheoppositemeaning,e.g.,closefriendshipsvsfriendshipsclose• Often,knowingnativelanguageofspeaker/originofthesourcetextmighthelphere

12

SubstitutionErrors

• Theseinvolveusinganinappropriateterminanexpressioninsteadofatermincorrectusage

“Thisactordoes money”vs“Thisactormakesmoney”

“Whereisthenearestquickfood place?”vs“Whereisthenearestfastfoodplace?”

• Mostcommontypesofcollocationerrors• Oftencausemiscommunicationproblemswhiletalking,writing,searchingetc.• Manyapproachesintheliteratureaddressmainlysubstitutionerrors• Theycanbepotentiallyappliedtoaddresstheothertypesaswell• Incorporationofcommonsenseknowledgeisparticularlyusefulhere

13

AddressingOddCollocationsbyLinguisticClassification

• Someworksfocusonclassifyingcollocationerrorsfromalinguisticperspective• Usingcollocationmeasuresonsyntacticpatternsforlexicalclassificationascorrectlycollocatedtermvserror[Futagi etal.,2008]• Consideringsourcelanguage(ofESLlearnerormachinegeneratedtext)toclassifycollocations[Dahlmeier,2011]

14

CollocationMeasuresonSyntacticPatterns

• Thisworkaddresses7aspectsoflexicalcollocations• Collocationerrorslexicallyclassifiedusingcandidatewordstrings• POStaggingoftextsisconductedfollowedbypatternmatching

15

[Futagi etal.]

CollocationMeasuresonSyntacticPatterns(Contd.)

• Afterspellchecking,variantsofwordstringsbuiltwitharticles,synonymsetc.• WordstringslookedupinareferenceDB(RRDB)tofindamatch• Ifnomatchfound,itisclassifiedasacollocationerror

[Futagi etal.]

16

CollocationMeasuresonSyntacticPatterns(Contd.)

• Measureofcollocationstrength• Rankratiostatistic• From1bwordsofnativespeakertexts• Incorporatingcommonsenseknowledge

• Whenevaluatedbyagoldstandardwithnativespeakers,this workgivesaround85%precisioninclassification• Thisworkdoesnotprovidecorrectsuggestionsasresponsestocollocationerrors

[Futagi etal.]

17

SourceLanguagetoClassifyCollocations

• Errorsoftencausedbysemanticsimilarityofwordsinsourcelanguage• ThisiscalledtheL1language• Literaltranslationtodestinationlanguagecancausecollocationerrors• Thus,L1inducedparaphrasesareproposedforclassifyingcollocations

18

OveradozenEnglishTranslations:look,see,watch,readetc.

vs

[Dahlmeier etal.]

PossibletranslationfromsourceIliketolookmovies

Iliketowatchmovies

SourceLanguagetoClassifyCollocations(Contd.)

• NUCLE:Annotated1mwordcorpusof1400essaysbyESLuniversitystudents• Annotatedwithstart&endoffset,errortype,goldstandardcorrection• IncorporatescommonsenseknowledgefromprofessionalEnglishinstructors• Theyfilteroutpreposition&articleerrors,focusoncollocationsinvolvingsemantics

19

StatisticsofNUCLEAnalysis

[Dahlmeier etal.]

SourceLanguagetoClassifyCollocations(Contd.)

• Detectederrorsclassifiedas:Spelling,Homophone,Synonyms,L1-transfer• Spelling:Editdist.(erroneousphrase,correction)<threshold• Homophone:(erroneousword,correction)havesamepronunciation• Synonym:(erroneousword,correction)havesimilarmeaning• L1-transfer:(erroneousphrase,correction)shareacommontranslation

[Dahlmeier etal.]

20

SourceLanguagetoClassifyCollocations(Contd.)

• NumberoferrorsinL1-transfer> othertypes• ExtractEnglish-L1,L1-Englishphrasesmax3words• Phraseextractionheuristic:

• Here,f:foreignlanguagephrase• Translationprobabilitiesp(e1|f),p(f|e2)predictedbymaxlikelihoodestimation• Onlykeepphraseswithprobability>threshold(0.001inthiswork)• Thisservesasthebasisforsuggestingcorrections

[Dahlmeier etal.]

AnalysisofCollocationErrors

21

Discussion

• Theseresearchworksclearlyfocusmoreonlexicalclassificationofcollocationerrors• Linguisticperspectivesaresignificanthere• Commonsenseknowledgeisincludedincollocationerrorclassificationusingcorporafromnativespeakers/Englishinstructors• Theseworksprovideaninsightintothereasonsforcollocationerrorsandtheirgrammaticalplacements• Suchresearchheadstowardsproposingcorrectivemeasures

22

CollocationErrorDetectionandCorrection

• Theseapproachesdeveloptoolsfortheactualdetectionandcorrectionofcollocationerrors• AwkChecker:Whileauserwritesatextdocument,flagcollocationerrorsandsuggestreplacementsthatcorrespondcloselytoconsensususingword-levelstatisticaln-grams[Parketal.,2008]• CollOrder:Whenauserentersaterminthetool,detectcollocationerrorsandprovidecorrectlyorderedcollocatedresponsesasoutputsusinganensembleofsimilaritymeasures[Vargheseetal.,2015]

23

AwkChecker

• End-usertooltocorrectcollocationerrorsinwrittendocuments• Userswritetext,AwkwardphrasesareCheckedbyhighlightingthem• Userscanclickawkwardphrasestoseesuggestedreplacements• 1st evertoolforcollocationerrorcorrection

24

AwkChecker’s userinterface:A)FlaggedphrasesinthecompositionwindowB)Suggestedreplacementfor“powerfultea”

[Parketal.]

AwkChecker (Contd.)

• Buildsstatisticaln-grams(sequencesofnwords)fromtrainingcorpus&recordsfrequencies• Analyzesuserinputagainstcorpustofindifaphraseisacollocationerror• Flagserrorifthereexistsimilarphraseswithfrequency>inputfrequency• Generatesreplacementsusingn-gramfrequencybasedapproach• Candidateswithmuchhigherfrequencyarepotentialreplacements

25

[Parketal.]

AwkChecker (Contd.)

• Statisticaln-gramsareusedoverrelevantcorporaincludingWikipedia• Helpfulincapturingcommonsensewithdomain-specificknowledgeusingfrequency-basedapproach• Example:Referringtoamedicalcorpustoflagphrasesawkwardinmedicalresearchwriting• Assumption:Relevantcorporaarecorrectmorefrequentlythantheyareincorrect• Evaluationrevealsusefulnessincollocationcorrection,butdetailsofaccuracynotdiscussed

26

[Parketal.]

CollOrder• Detects&correctscollocationerrorsintermsinputtothetool• Outputsrankedresponsesofcorrectlycollocatedterms• Correctcollocationssource:ANC/BNC(American/BritishNationalCorpus)• Includescommonsenseknowledgefromnativespeakers’writings• UsefulinWebqueries,textdocuments,ESLtranslationetc.

27

ApproachintheCollOrder tool[Vargheseetal.]

CollOrder (Contd.)• Ensembleofmeasuresisusedforsimilaritysearchandranking• ConditionalProbability:MeasuresrelativeoccurrenceoftermsA&B

• Jaccard’s Coefficient:MeasuresextentofsemanticsimilaritybetweenA&B

• WebJaccard:Toreduceadverseeffectsofrandomco-occurrence(duetoscale&noiseinWebdata)[Bolegalla etal.,2009]

28

[Vargheseetal.]

CollOrder (Contd.)

• These&othermeasures(FrequencyNormalized,FrequencyRatio)areused[Vargheseetal.,2015]• Differentmeasuresempiricallyyieldgoodresultsindifferentscenarios• Ensembleofmeasureswithclassifiersthusproposedtooptimizeperformance• Classifierused:JRIP,implementationofRIPPER(RepeatedIncrementalPruningtoProduceErrorReduction)[Cohen,1995]• CollOrder evaluationwithMTurk onnativespeakers:Averageaccuracy92.44%

29

Exampleofensemblelearningbytheclassifier“bluesky”isavalidsuggestion,classifiedas“y”“nightsky”isnotavalidsuggestion,classifiedas“n”

[Vargheseetal.]

OtherRelatedWorks

• [Ramosetal.,2010]buildannotationschemawith3DtopologytoclassifycollocationsmainlyinSpanish&Englishtranslation:• 1st dimensionfindsiferrorisforwholeorpartofcollocation• 2nd dimensiondoeslanguage-orientederroranalysis• 3rd dimensiondoesinterpretiveerroranalysis

• [Lietal.,2009]useaprobabilisticapproachforcollocationcorrection:• UseBNCandWordNetaslanguagelearningsources• Suggestcorrectionsbasedoncommonlyusedexpressions• Donotdevelopatoolforcollocationdetection&correction

30

Discussion

• Collocationerrorcorrectiontoolsintheliteraturearefoundusefulbyusers• Commonsenseknowledgefromnativespeakersistypicallyentailedinthesourcecorporausedforlearning• Approachesinlinguisticclassificationaswellasincollocationcorrectionrelyheavilyonfrequency

• Thus,potentialissuesrelatedtosparsedatawithcorrectcollocationscallforfurtherresearch

31

TexttoKnowledgeandKnowledgetoText

• Collocationapproachesstartwithtextandextractknowledgefromcorpora• Differentmethodsusedforknowledgeextraction - probabilistic,ensemble• Extractedknowledgeusedforlinguisticclassification,errorcorrection

• Statisticaltextcategorizationoccursduetoanalysisinlinguisticclassification• Correctlycollocatedtextresponsesofferedassuggestionsinerrorcorrection• Thus,extractedknowledge servestoprovidetextbasedoutputs

• Commonsense knowledgeplaysarolemainlyinsourcecorporafromnativespeakers&expertwritings

• Thiscontributestomachineintelligencebyprovidingbettermachinetranslationincorporatingcommonsense

32

References• Bollegala,D.,Matsuo,Y.andIshizuka,M.,Measuringthesimilaritybetweenimplicitsemanticrelationsusingwebsearchengines,WSDM2009,pp.104-113.

• Cohen,W.,Fasteffectiveruleinduction.InProceedingsoftheInternationalConferenceonMachineLearning,ICML1995,pp.115–123.

• Dahlmeier,D.andNg.,H.T.,Correctingsemanticcollocationerrorswithl1-inducedparaphrases.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2011,pp.107–117.

• Futagi,Y.,Deane,P.,Chodorow,M.andTetreault.,J.,Acomputationalapproachtodetectingcollocationerrorsinthewritingofnon-nativespeakersofEnglish, ComputerAssistedLanguageLearning2008,21(4):353–367.

• Li-E,L.A.,Wible,D.andTsao,N-L.,Automatedsuggestionsformiscollocations,Proceedingsofthe4thWorkshoponInnovativeUseofNLPforBuildingEducationalApplications,2009,pp.47-50.

• Park,T.,Lank,E.,Poupart,P.andTerry,M.,Istheskypuretoday- Awkchecker:Anassistivetoolfordetectingandcorrectingcollocationerrors,ACMSymposiumonUserInterfaceSoftwareandTechnology2008,pages121–130.

• Ramos,M.A.,Wanner,L.,Vincze,O.,delBosque,G.C.,Veiga,N.V.,Suárez,E.M.andGonzález,S.P.,TowardsaMotivatedAnnotationSchemaofCollocationErrorsinLearnerCorpora,LREC2010, pp.3209-3214.

• Varghese,A.,Varde,A.,Peng,J.andFitzpatrick.E.,AframeworkforcollocationerrorcorrectioninWebpagesandtextdocuments,ACMSIGKDDExplorations2015,17(1):14–23. 33