Stats170A:ProjectinDataScience
TextAnalysisandClassification
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Reading,Homework,Lectures
• Referencereading:– Python
• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
• Homework7– Dueby2pmWednesdaynextweek
• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Roomforimprovement….
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
PredictingtheSentimentofTweets
wow:othiswasincluded intheplaylist#awesome positive
Icouldnotbehappierwithmyliferightnow positive
HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive
I'mtiredashellInevergetaoffdayduring theweekanymore negative
Themoviewasreallydullandstupid negative
@washingtonpost @silvajanes Howawful!!!!!! negative
“Documents” ClassLabels
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
PredictingtheScoresofMovieReviews
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
EntityRecognition
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
ExamplesofUserLocationFieldsinTwitter
User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
BasicConceptsinTextRepresentation
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
RepresentingTextDocumentsNumerically
• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)
• Definea vocabulary=setofwordsorterms
• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary
• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”
• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
ConceptsandTerminology
• Document:– Collectionofwords
• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc
Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..
eachsectionofabookcouldbeconsidereda“document”
• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
ConceptsandTerminology
• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc
• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc
• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms
• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
TypicalPipelineforBuildingaTextClassifier
Tokenization
StopWordRemoval
VocabularyDefinition
CreateBagofWords
BuildTextClassifier
Document(string)
Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17
ExampleofaTextAnalysisPipeline(forInformationExtraction)
Fromhttp://www.nltk.org/book/ch07.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
?
Tokenization
l Splituptextintoindividualtokens(words/terms)
l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens
If you had a magic potion I’d love to have it.
If you had a magic potion I’d love to have it
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
IssuesinTokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• San Francisco → onetokenortwo?
• m.p.h., PhD. → ??
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
TokenizationSoftware
• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup
• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,
• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc
– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:
If you had a magic potion I’d love to have it . If that makes sense
Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
SentenceDetection:ExampleusingaDecisionTree
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
Example:TokenizationofTweets
Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets
from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize
tknzr = TweetTokenizer()
for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']
NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']
NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']
SampleOutputfromthe2Tokenizers
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
Example:DefiningaVocabulary
Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”
Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this
Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.
Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)
Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
DefiningtheVocabulary(afterTokenization)
• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix
• Basicapproach:usesinglewords(unigrams)asterms
• Removeverycommonterms(e.g.,stopwords)
• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc
• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)
“feelgood”/“NewYorkCity”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
FrequencyofWordinEnglishusage
RankofWordbyFrequency
Verylong“tail”ofrare
words
Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
Stopword Removal
l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)
If you had a magic potion I’d love to have it . If that makes sense
If you had a magic potion I’d love to have it . If that makes sense
Example:
But what about this?
[Prince Hamlet] To be or not to be ...
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
NLTKStopWord List
Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
N-grams
• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”
– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,
“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”
- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore
informative than“time”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
AutomaticallyFindingN-grams
- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)
- I visited Microsoft Research for a computer science job interview last week.”
- * visited Microsoft Research * * computer science job interview last week.”
- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics
- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)
- Sameideafortri-grams,4grams,etc.
- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
OverviewofTextClassification
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
TextClassification
• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories
• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)
• ClassificationMethods– NaïveBayeswidelyusedbaseline
• Fastandreasonablyaccurate– LogisticRegression
• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning
• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
ExampleofDocumentbyTermMatrix
predict finance stocks goal score team ClassLabel
d1 3 1 4 0 0 1 1
d2 2 2 1 0 1 0 1
d3 1 1 2 0 0 0 1
d4 1 2 0 0 0 0 1
d5 3 1 1 0 1 0 1
d6 1 0 0 1 3 1 2
d7 0 0 1 4 1 0 2
d8 2 0 0 1 2 1 2
d9 1 0 0 2 1 2 2
d10 1 0 0 1 3 1 2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
PossibleWeightsforaLinearClassifier
predict finance stocks goal score team ClassLabel
Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5
d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
RealExamplefromYelpData
YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468
NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114
ArrayDimensions (595468, 176114)
Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
Stopword andrarewordremoval
FrequencyCounts
DocumentClassifier
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
NewDocument
DocumentClassifier
ListsofTokens
Tokenization BagofWords
LabelPrediction
Stopword andrarewordremoval
FrequencyCounts
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
KeyStepsinDocumentAnalysisPipelines(forBagofWords)
• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)
• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming
• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)
• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
TF-IDFWeightingofFeatures
Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts
TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd
IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)
(whereN=totalnumberofdocsinthecorpus)
TF-IDF(t,d)=TF(t,d)*IDF(t,d)
TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
TF-IDFExample
N=1000inacorpusofnewsarticles
Term1:t=“city”,appearsin500documents
IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)
Term2:t=“freeway”,appearsin10documents
IDF(t)=log(1000/10)=log(100)=6.64
Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
ClassificationMethodsforText
• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful
• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
From:https://github.com/TeamHG-Memex/eli5
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
SentimentAnalysis
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
Veryusefultutorial,manypracticaltipsandideas
Onlineathttp://sentiment.christopherpotts.net/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
Example:PredictingifReviewTextisPositiveorNegative
• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)
• Challenges,e.g.,howtohandlenegation?– Positivereview:
• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”
• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”
– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat
moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
NegationMarking(duringTokenization)
• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
ExamplesofNegationMarking
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
ImprovementsinClassificationAccuracy
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57
SentimentLexicons(Dictionaries)
• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative
• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)
• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc
– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and
neutrality/objectiveness
– LinguisticInquiryandWordCount(http://www.liwc.net/)
– NRCWord-Emotion AssociationLexicon(nextslide)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59
YelpChallengeDataSet
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60
Jupyter NotebookExamplewithYelpDataset
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61
VectorRepresentationsforText
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62
VectorRepresentationsallowGeneralization
• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]
• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar
• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then
wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 63
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 64
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 65
AnotherExampleofWordEmbedding
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 66
Publicly-AvailablePre-TrainedWordEmbeddings
• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews
articles,entitynamesfromFreebase
• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets
• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 67
Component1
Component2
x
y
z
Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller
CosineDistancebetweentwoVectors
Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 68
ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/
Cosinesimilarityto“France”
Cosinesimilarityto“SanFrancisco”
Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 69
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 70
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 71
1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59
MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings
Rank Word CosineSimilarity
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 72
1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55
MostSimilarVectorsto‘iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 73
1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62
MostSimilarVectorsto‘Iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 74
RecurrentNeuralNetworksforSequentialData
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 75
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 76
NetworkModelforPredictingtheNextWordinaSentence
thedogsawcatonwall
Sentence:Thedogsawthecatonthewall.
.
1000000
0100000
h1
h2
Binaryinput,“One-hot”encoding
thedogsawcatonwall.
Hiddenunitactivations(real-valued)
Binarytargetoutput,
“One-hot”encoding
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 77
StandardRecurrentNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 78
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 79
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe
previousstateandthecurrentinput
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 80
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used
acrossthesequence
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 81
LearningtheWeightsinaRecurrentNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 82
Example:PredictionofSequencesofCharacters
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 83
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 84
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 85
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 86
OutputfromanRNNModelTrainedonShakespeare
KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.
Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.
DUKE VINCENTIO: Well, your wit is in the care of side and that.
Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 87
OutputfromanRNNModelTrainedonCookingRecipes
Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 88
Imageclassification
Imagecaptioning
Sentimentanalysis
Machinetranslation
Syncedsequence(videoclassification)
DifferentRecurrentNetworkModels
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 89
Part-of-SpeechTagging
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 90
PartofSpeech(POS)Tagging
• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection
• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…
• AssigningPOScategoriestowordsintextisknownastagging
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 91
UniversalTagset (asusedinNLTK)
12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation
See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 92
UniversalTagset (usedinNLTK)
Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy
(fromSection2.3inChapter5ofNLTKBook)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 93
POSTaggingAlgorithms
• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models
• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning
• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments
• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)
• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,
andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 94
ChallengesinPOSTagging
• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable
• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto
interpret ‘bridge’ hereasaverb
• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight
• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity
• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 95
SoftwareforPartofSpeechTagging
• Manysoftwarepackagesandonlinetoolsavailable
• NLTKPOSTagger
• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)
• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset
• Onlinedemo– http://demo.ark.cs.cmu.edu/parse
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 96
Example:RemovingStopWordswithaPOSTagger
• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses
SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun
Examplerule:extractadjectivesandnounsonly
Top Related