Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 /...

Post on 09-Jun-2020

13 views 0 download

Transcript of Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 /...

TextClassification&LinearModelsCMSC723/LING723/INST725

MarineCarpuat

Slidescredit:DanJurafsky &JamesMartin,JacobEisenstein

Logistics/Reminders

• Homework1– dueThursdaySep7by12pm.

• Project1comingup

• Thursdaylecturetime:projectset-upofficehourinCSIC1121

Recap:WordMeaning

2coreissuesfromanNLPperspective

• Semanticsimilarity:giventwowords,howsimilararetheyinmeaning?• Keyconcepts:vectorsemantics,PPMIanditsvariants,cosinesimilarity

• Wordsensedisambiguation:givenawordthathasmorethanonemeaning,whichoneisusedinaspecificcontext?• Keyconcepts:wordsense,WordNetandsenseinventories,unsuperviseddisambiguation(Lesk),superviseddisambiguation

Today

• Textclassificationproblems• andtheirevaluation

• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes

Textclassification

Isthisspam?From: "Fabian Starr“ <Patrick_Freeman@pamietaniepeerelu.pl>Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintoshhttp://iiled.org/Cj4Lmx70-90% Discounts from retail price!!!All sofware is instantly available to download - No Need Wait!

Whatisthesubjectofthisarticle?

• Antogonists andInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

MeSH SubjectCategoryHierarchy

?

MEDLINE Article

TextClassification

• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …

TextClassification:definition

• Input:• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}

• Output:apredictedclassy Î Y

ClassificationMethods:Hand-codedrules

• Rulesbasedoncombinationsofwordsorotherfeatures• spam:black-list-addressOR(“dollars”AND“havebeenselected”)

• Accuracycanbehigh• Ifrulescarefullyrefinedbyexpert

• Butbuildingandmaintainingtheserulesisexpensive

ClassificationMethods:SupervisedMachineLearning

• Input• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}• a trainingsetofm hand-labeleddocuments(d1,y1),....,(dm,ym)

• Output• alearnedclassifierdà y

Aside:gettingexamplesforsupervisedlearning

• Humanannotation• Byexpertsornon-experts(crowdsourcing)• Founddata

• Howdoweknowhowgoodaclassifieris?• Compareclassifierpredictionswithhumanannotation• Onheldout testexamples• Evaluationmetrics:accuracy,precision,recall

The2-by-2contingencytable

correct notcorrectselected tp fp

notselected fn tn

Precisionandrecall

• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected

correct notcorrectselected tp fp

notselected fn tn

Acombinedmeasure:F

• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½):

F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

LinearClassifiers

Bagofwords

Definingfeatures

Definingfeatures

Linearclassification

LinearModelsforClassification

Featurefunction

representation

Weights

Howcanwelearnweights?

• Byhand

• Probability• e.g.,Naïve Bayes

• Discriminativetraining• e.g.,perceptron,supportvectormachines

GenerativeStoryforMultinomialNaïveBayes

• Ahypotheticalstochasticprocessdescribinghowtrainingexamplesaregenerated

PredictionwithNaïveBayesScore(x,y)

PredictionwithNaïveBayesScore(x,y)

ParameterEstimation

• “countandnormalize”• Parametersofamultinomialdistribution

• Relativefrequencyestimator• Formally:thisisthemaximumlikelihoodestimate

• SeeCIMLforderivation

Smoothing(addalpha/Laplace)

NaïveBayesrecap

Today

• Textclassificationproblems• andtheirevaluation

• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes