TextClassification1
Prof.SameerSinghCS295:STATISTICALNLP
WINTER2017
January12,2017
BasedonslidesfromNathanSchneider,NoahSmith,DanKleinandeveryoneelsetheycopiedfrom.
TextClassification1
CS295:STATISTICALNLP(WINTER2017) 2
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
TextClassification
CS295:STATISTICALNLP(WINTER2017) 3
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
SentimentAnalysis
CS295:STATISTICALNLP(WINTER2017) 4
Filledwithhorrificdialogue,laughablecharacters,alaughableplot,adreallynointerestingstakesduringthisfilm,"StarWarsEpisodeI:ThePhantomMenace"isnotatallwhatIwantedfromafilmthatissupposedtobethehugeopeningtothesegueintothefantasticOriginalTrilogy.Thepositivesincludethescore,thesound…
OtherExamples
CS295:STATISTICALNLP(WINTER2017) 5
• Reviewsoffilms,restaurants,products:positivevs.negative• Amazonreviewsdata,IMDBreviewsdata
• Library-likesubjects(e.g.,theDeweydecimalsystem)• Newsstories:politicsvs.sportsvs.businessvs.technology...
• 20newsgroupdata• Authorattributes:identity,politicalstance,gender,age,...• Email:spamvs.not
• Gmail:important,promotion,updates,socialmedia,…• Whatisthereadinglevelofapieceoftext?
• Automaticgraders?• Howinfluentialwillascientificpaperbe?• Advertisementrecommendations…• Willapieceofproposedlegislationpass?
• Identifythepresidentialcandidatefromspeeches• Postrecommendations/Fakenewsdetection
• Canmajorlyinfluencetheworld!
FormalSetup
CS295:STATISTICALNLP(WINTER2017) 6
Classification
SupervisedLearning
TrainingAlgorithm
Evaluation:ContingencyTable
CS295:STATISTICALNLP(WINTER2017) 7
Accuracy
CS295:STATISTICALNLP(WINTER2017) 8
Problem
• Classimbalancehurts..• Gettingoneclassrightmattersmorethantheother(retrieval)
PrecisionandRecall
CS295:STATISTICALNLP(WINTER2017) 9
>2Classes?
CS295:STATISTICALNLP(WINTER2017) 10
Macro-averagedMeasures
Micro-averagedMeasures
StatisticalSignificance
CS295:STATISTICALNLP(WINTER2017) 11
McNemar’s Test,Psychometrika, (1947)MoretestsinSmithbook,appendixB
TextClassification
CS295:STATISTICALNLP(WINTER2017) 12
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
ClassificationusingJointProb
CS295:STATISTICALNLP(WINTER2017) 13
NaïveBayesClassifier
CS295:STATISTICALNLP(WINTER2017) 14
Twoassumptions
• Wordorderingdoesnotmatter(BagofWords)
NaïveBayesClassifier
CS295:STATISTICALNLP(WINTER2017) 15
Twoassumptions
• Wordorderingdoesnotmatter (BagofWords)• Wordsareindependentgivencategory
EstimationofParameters
CS295:STATISTICALNLP(WINTER2017) 16
ProblemwithNaïveBayes
CS295:STATISTICALNLP(WINTER2017) 17
LinearModels
CS295:STATISTICALNLP(WINTER2017) 18
NaïveBayesasaLinearModel
CS295:STATISTICALNLP(WINTER2017) 19
TextClassification
CS295:STATISTICALNLP(WINTER2017) 20
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
GroupProjects
CS295:STATISTICALNLP(WINTER2017) 21
• Idealteamsizeis3• Absolutemaximumof4• <3ifIapprove(ongoingwork)
GroupsfortheProject
• Firsttworeportsareveryshort(1page)• Finalreportmattersthemost
SubmitFourReports
• Outputisanyphraseorsentence,definitely!• Inputisanyphraseorsentence
• Outputisasequenceorstructure(yes!)• Classification:onlyifoverwordsorphrases
• Outputislinguisticclasses/structures(yes!)
HowdoIknowit’sNLP?
ScopeofWork
CS295:STATISTICALNLP(WINTER2017) 22
• NewTask/Data• NewMethod/Models• NewApplicationofExistingMethodtoExistingTask
Novelty
• Youdonothavemuchtime!• Aimtohavethewholepipelinedonesoon• Keepthe“scale”ofthedatasmall,sub-sampleifneeded• Bettertohaveacompletefinishedreport
• thangrandideasthatdidnotwork
Butnottoomuch!
• Youdonothavetocodeeverything• Exploitexistingcode,datasets,libraries,webservices• Donotreinventallthewheels!
Reuse
Example1:What’stheword..
CS295:STATISTICALNLP(WINTER2017) 23
What’sthewordforsomeoneusingpretentiouswords? lexiphanic
MachineLearning(LSTM)
definitionofawordfromthedictionary theworditself
ThiscanbeacoolTwitterbot!
• Accuracyofguessingtheword,usingdefinitionsfromdifferentdictionary?
• Baselines:Google,reversedictionary.org,…Evaluation
Example2:SQuAD
CS295:STATISTICALNLP(WINTER2017) 24
https://rajpurkar.github.io/SQuAD-explorer/
Tesla was the fourth of five children. He had an older brother named Dane and three sisters, Milka, Angelina and Marica. Dane was killed in a horse-riding accident when Nikola was five. In 1861, Tesla attended the "Lower" or "Primary" School in Smiljan where he studied German, arithmetic, and religion. In 1862, the Tesla family moved to Gospić, Austrian Empire, where Tesla's father worked as a pastor. Nikola completed "Lower" or "Primary" School, followed by the "Lower Real Gymnasium" or "Normal School."
How many siblings did Tesla have?fourWhat was Tesla’s brother’s name?DaneWhat happened to Dane?killed in a horse-riding accident
DatasetsandPapers
CS295:STATISTICALNLP(WINTER2017) 25
• SearchKaggle,Quora,etc forlargetextdatasets• SeerecentpapersinNLPforreleaseddatasets• Lookfor“sharedtasks”,“challenges”,workshops• Linkstosomeexistingdatasetscomingtowebsitesoon
Data
• NLPConferences:ACL,EMNLP,NAACL• MLConferences:NIPS,ICML,ICLR,AAAI• Datafocusedvenues:TREC/TAC,SemEval,CONLL• Workshopsattheseconferences:interestingdirections• Morepaperscomingsoontothewebsite
Papers
WritingthePitch
CS295:STATISTICALNLP(WINTER2017) 26
• Teamnameandmembers• Singlesentencedescriptionforeachmember
• (approximately)whattheywilldo• Singlesentenceonwhatmakesyourteamdiverse
Team
• MotivationandProblemDescription• Plannedapproach:tentative• Evaluation:usually,mostimportant
Project
• If1or2,meetmebefore/on January17(o.w.noneed)• Everygrouphastomeetafterwardstodiscusstheproject
Appointment
Upcoming…
CS295:STATISTICALNLP(WINTER2017) 27
• Homework1isup!• Nextlectureswillcontinuewithmoredetails• SignupfortheKaggle account(@uci.edu email)• Due:January26,2017
Homework
• ProjectpitchisdueJanuary23,2017!• Startassemblingteamsnow!(usePiazza)• Startlookingatpapers,data,etc.forideas
Project
Top Related