Stats 170A: Project in Data Science Text Analysis and ...
Transcript of Stats 170A: Project in Data Science Text Analysis and ...
![Page 1: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/1.jpg)
Stats170A:ProjectinDataScience
TextAnalysisandClassification
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)
![Page 2: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/2.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Reading,Homework,Lectures
• Referencereading:– Python
• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
• Homework7– Dueby2pmWednesdaynextweek
• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects
![Page 3: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/3.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
![Page 4: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/4.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Roomforimprovement….
![Page 5: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/5.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
PredictingtheSentimentofTweets
wow:othiswasincluded intheplaylist#awesome positive
Icouldnotbehappierwithmyliferightnow positive
HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive
I'mtiredashellInevergetaoffdayduring theweekanymore negative
Themoviewasreallydullandstupid negative
@washingtonpost @silvajanes Howawful!!!!!! negative
“Documents” ClassLabels
![Page 6: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/6.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
PredictingtheScoresofMovieReviews
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
![Page 7: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/7.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
EntityRecognition
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
![Page 8: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/8.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
ExamplesofUserLocationFieldsinTwitter
User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL
![Page 9: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/9.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
![Page 10: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/10.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
![Page 11: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/11.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
BasicConceptsinTextRepresentation
![Page 12: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/12.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
RepresentingTextDocumentsNumerically
• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)
• Definea vocabulary=setofwordsorterms
• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary
• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”
• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords
![Page 13: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/13.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
![Page 14: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/14.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
ConceptsandTerminology
• Document:– Collectionofwords
• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc
Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..
eachsectionofabookcouldbeconsidereda“document”
• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017
![Page 15: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/15.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
ConceptsandTerminology
• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc
• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc
• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms
• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary
![Page 16: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/16.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
TypicalPipelineforBuildingaTextClassifier
Tokenization
StopWordRemoval
VocabularyDefinition
CreateBagofWords
BuildTextClassifier
Document(string)
Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem
![Page 17: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/17.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17
ExampleofaTextAnalysisPipeline(forInformationExtraction)
Fromhttp://www.nltk.org/book/ch07.html
![Page 18: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/18.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
![Page 19: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/19.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
![Page 20: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/20.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}
![Page 21: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/21.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
?
Tokenization
l Splituptextintoindividualtokens(words/terms)
l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens
If you had a magic potion I’d love to have it.
If you had a magic potion I’d love to have it
![Page 22: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/22.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
IssuesinTokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• San Francisco → onetokenortwo?
• m.p.h., PhD. → ??
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
![Page 23: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/23.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
TokenizationSoftware
• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup
• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,
• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc
– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:
If you had a magic potion I’d love to have it . If that makes sense
Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.
![Page 24: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/24.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
SentenceDetection:ExampleusingaDecisionTree
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
![Page 25: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/25.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
Example:TokenizationofTweets
Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets
from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize
tknzr = TweetTokenizer()
for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))
![Page 26: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/26.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']
NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']
NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']
SampleOutputfromthe2Tokenizers
![Page 27: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/27.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
Example:DefiningaVocabulary
Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”
Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this
Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.
Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)
Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)
![Page 28: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/28.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
DefiningtheVocabulary(afterTokenization)
• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix
• Basicapproach:usesinglewords(unigrams)asterms
• Removeverycommonterms(e.g.,stopwords)
• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc
• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)
“feelgood”/“NewYorkCity”
![Page 29: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/29.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
FrequencyofWordinEnglishusage
RankofWordbyFrequency
Verylong“tail”ofrare
words
Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png
![Page 30: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/30.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
Stopword Removal
l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)
If you had a magic potion I’d love to have it . If that makes sense
If you had a magic potion I’d love to have it . If that makes sense
Example:
But what about this?
[Prince Hamlet] To be or not to be ...
![Page 31: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/31.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
NLTKStopWord List
Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews
![Page 32: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/32.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
N-grams
• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”
– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,
“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”
- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore
informative than“time”
![Page 33: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/33.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
AutomaticallyFindingN-grams
- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)
- I visited Microsoft Research for a computer science job interview last week.”
- * visited Microsoft Research * * computer science job interview last week.”
- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics
- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)
- Sameideafortri-grams,4grams,etc.
- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK
![Page 34: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/34.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
![Page 35: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/35.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7
![Page 36: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/36.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
OverviewofTextClassification
![Page 37: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/37.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
TextClassification
• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories
• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)
• ClassificationMethods– NaïveBayeswidelyusedbaseline
• Fastandreasonablyaccurate– LogisticRegression
• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning
• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods
![Page 38: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/38.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
ExampleofDocumentbyTermMatrix
predict finance stocks goal score team ClassLabel
d1 3 1 4 0 0 1 1
d2 2 2 1 0 1 0 1
d3 1 1 2 0 0 0 1
d4 1 2 0 0 0 0 1
d5 3 1 1 0 1 0 1
d6 1 0 0 1 3 1 2
d7 0 0 1 4 1 0 2
d8 2 0 0 1 2 1 2
d9 1 0 0 2 1 2 2
d10 1 0 0 1 3 1 2
![Page 39: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/39.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
PossibleWeightsforaLinearClassifier
predict finance stocks goal score team ClassLabel
Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5
d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2
![Page 40: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/40.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
RealExamplefromYelpData
YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468
NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114
ArrayDimensions (595468, 176114)
Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%
![Page 41: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/41.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
Stopword andrarewordremoval
FrequencyCounts
DocumentClassifier
![Page 42: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/42.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
NewDocument
DocumentClassifier
ListsofTokens
Tokenization BagofWords
LabelPrediction
Stopword andrarewordremoval
FrequencyCounts
![Page 43: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/43.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
KeyStepsinDocumentAnalysisPipelines(forBagofWords)
• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)
• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming
• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)
• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc
![Page 44: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/44.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
TF-IDFWeightingofFeatures
Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts
TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd
IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)
(whereN=totalnumberofdocsinthecorpus)
TF-IDF(t,d)=TF(t,d)*IDF(t,d)
TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs
![Page 45: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/45.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
TF-IDFExample
N=1000inacorpusofnewsarticles
Term1:t=“city”,appearsin500documents
IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)
Term2:t=“freeway”,appearsin10documents
IDF(t)=log(1000/10)=log(100)=6.64
Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”
![Page 46: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/46.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
ClassificationMethodsforText
• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful
• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information
![Page 47: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/47.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
![Page 48: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/48.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
![Page 49: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/49.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
![Page 50: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/50.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
From:https://github.com/TeamHG-Memex/eli5
![Page 51: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/51.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
SentimentAnalysis
![Page 52: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/52.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
Veryusefultutorial,manypracticaltipsandideas
Onlineathttp://sentiment.christopherpotts.net/
![Page 53: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/53.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
Example:PredictingifReviewTextisPositiveorNegative
• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)
• Challenges,e.g.,howtohandlenegation?– Positivereview:
• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”
• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”
– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat
moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”
![Page 54: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/54.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
NegationMarking(duringTokenization)
• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
![Page 55: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/55.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
ExamplesofNegationMarking
![Page 56: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/56.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
ImprovementsinClassificationAccuracy
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
![Page 57: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/57.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57
SentimentLexicons(Dictionaries)
• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative
• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)
• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc
– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and
neutrality/objectiveness
– LinguisticInquiryandWordCount(http://www.liwc.net/)
– NRCWord-Emotion AssociationLexicon(nextslide)
![Page 58: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/58.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58
![Page 59: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/59.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59
YelpChallengeDataSet
![Page 60: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/60.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60
Jupyter NotebookExamplewithYelpDataset
![Page 61: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/61.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61
VectorRepresentationsforText
![Page 62: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/62.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62
VectorRepresentationsallowGeneralization
• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]
• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar
• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then
wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization
![Page 63: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/63.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 63
![Page 64: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/64.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 64
![Page 65: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/65.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 65
AnotherExampleofWordEmbedding
![Page 66: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/66.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 66
Publicly-AvailablePre-TrainedWordEmbeddings
• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews
articles,entitynamesfromFreebase
• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets
• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel
![Page 67: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/67.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 67
Component1
Component2
x
y
z
Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller
CosineDistancebetweentwoVectors
Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
![Page 68: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/68.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 68
ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/
Cosinesimilarityto“France”
Cosinesimilarityto“SanFrancisco”
Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
![Page 69: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/69.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 69
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
![Page 70: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/70.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 70
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
![Page 71: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/71.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 71
1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59
MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings
Rank Word CosineSimilarity
![Page 72: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/72.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 72
1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55
MostSimilarVectorsto‘iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
![Page 73: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/73.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 73
1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62
MostSimilarVectorsto‘Iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
![Page 74: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/74.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 74
RecurrentNeuralNetworksforSequentialData
![Page 75: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/75.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 75
![Page 76: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/76.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 76
NetworkModelforPredictingtheNextWordinaSentence
thedogsawcatonwall
Sentence:Thedogsawthecatonthewall.
.
1000000
0100000
h1
h2
Binaryinput,“One-hot”encoding
thedogsawcatonwall.
Hiddenunitactivations(real-valued)
Binarytargetoutput,
“One-hot”encoding
![Page 77: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/77.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 77
StandardRecurrentNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1
![Page 78: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/78.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 78
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
![Page 79: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/79.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 79
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe
previousstateandthecurrentinput
![Page 80: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/80.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 80
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used
acrossthesequence
![Page 81: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/81.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 81
LearningtheWeightsinaRecurrentNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights
![Page 82: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/82.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 82
Example:PredictionofSequencesofCharacters
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
![Page 83: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/83.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 83
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
![Page 84: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/84.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 84
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
![Page 85: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/85.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 85
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
![Page 86: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/86.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 86
OutputfromanRNNModelTrainedonShakespeare
KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.
Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.
DUKE VINCENTIO: Well, your wit is in the care of side and that.
Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 87: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/87.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 87
OutputfromanRNNModelTrainedonCookingRecipes
Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc
![Page 88: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/88.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 88
Imageclassification
Imagecaptioning
Sentimentanalysis
Machinetranslation
Syncedsequence(videoclassification)
DifferentRecurrentNetworkModels
![Page 89: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/89.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 89
Part-of-SpeechTagging
![Page 90: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/90.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 90
PartofSpeech(POS)Tagging
• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection
• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…
• AssigningPOScategoriestowordsintextisknownastagging
![Page 91: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/91.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 91
UniversalTagset (asusedinNLTK)
12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation
See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/
![Page 92: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/92.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 92
UniversalTagset (usedinNLTK)
Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy
(fromSection2.3inChapter5ofNLTKBook)
![Page 93: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/93.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 93
POSTaggingAlgorithms
• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models
• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning
• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments
• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)
• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,
andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns
![Page 94: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/94.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 94
ChallengesinPOSTagging
• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable
• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto
interpret ‘bridge’ hereasaverb
• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight
• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity
• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’
![Page 95: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/95.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 95
SoftwareforPartofSpeechTagging
• Manysoftwarepackagesandonlinetoolsavailable
• NLTKPOSTagger
• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)
• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset
• Onlinedemo– http://demo.ark.cs.cmu.edu/parse
![Page 96: Stats 170A: Project in Data Science Text Analysis and ...](https://reader034.fdocuments.in/reader034/viewer/2022042516/6261927940ea0b3e2518fcd5/html5/thumbnails/96.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 96
Example:RemovingStopWordswithaPOSTagger
• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses
SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun
Examplerule:extractadjectivesandnounsonly