Introduction to regular expressions - Amazon S3 · DataCamp Natural Language Processing...

31
DataCamp Natural Language Processing Fundamentals in Python Introduction to regular expressions NATURAL LANGUAGE PROCESSING FUNDAMENTALS IN PYTHON Katharine Jarmul Founder, kjamistan

Transcript of Introduction to regular expressions - Amazon S3 · DataCamp Natural Language Processing...

DataCamp NaturalLanguageProcessingFundamentalsinPython

Introductiontoregularexpressions

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

WhatisNaturalLanguageProcessing?Fieldofstudyfocusedonmakingsenseoflanguage

UsingstatisticsandcomputersYouwilllearnthebasicsofNLP

TopicidentificationTextclassification

NLPapplicationsinclude:ChatbotsTranslationSentimentanalysis...andmanymore!

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatexactlyareregularexpressions?StringswithaspecialsyntaxAllowustomatchpatternsinotherstringsApplicationsofregularexpressions:

FindallweblinksinadocumentParseemailaddresses,remove/replaceunwantedcharacters

In[1]:importre

In[2]:re.match('abc','abcdef')Out[2]:<_sre.SRE_Matchobject;span=(0,3),match='abc'>

In[3]:word_regex='\w+'

In[4]:re.match(word_regex,'hithere!')Out[4]:<_sre.SRE_Matchobject;span=(0,2),match='hi'>

DataCamp NaturalLanguageProcessingFundamentalsinPython

CommonRegexPatternspattern matches example

\w+ word 'Magic'

DataCamp NaturalLanguageProcessingFundamentalsinPython

CommonRegexpatterns(2)pattern matches example

\w+ word 'Magic'

\d digit 9

DataCamp NaturalLanguageProcessingFundamentalsinPython

Commonregexpatterns(3)pattern matches example

\w+ word 'Magic'

\d digit 9

\s space ''

DataCamp NaturalLanguageProcessingFundamentalsinPython

Commonregexpatterns(4)pattern matches example

\w+ word 'Magic'

\d digit 9

\s space ''

.* wildcard 'username74'

DataCamp NaturalLanguageProcessingFundamentalsinPython

Commonregexpatterns(5)pattern matches example

\w+ word 'Magic'

\d digit 9

\s space ''

.* wildcard 'username74'

+or* greedymatch 'aaaaaa'

DataCamp NaturalLanguageProcessingFundamentalsinPython

Commonregexpatterns(6)pattern matches example

\w+ word 'Magic'

\d digit 9

\s space ''

.* wildcard 'username74'

+or* greedymatch 'aaaaaa'

\S notspace 'no_spaces'

DataCamp NaturalLanguageProcessingFundamentalsinPython

Commonregexpatterns(7)pattern matches example

\w+ word 'Magic'

\d digit 9

\s space ''

.* wildcard 'username74'

+or* greedymatch 'aaaaaa'

\S notspace 'no_spaces'

[a-z] lowercasegroup 'abcdefg'

DataCamp NaturalLanguageProcessingFundamentalsinPython

Python'sreModuleremodule

split:splitastringonregex

findall:findallpatternsinastring

search:searchforapattern

match:matchanentirestringorsubstringbasedonapattern

Patternfirst,andthestringsecondMayreturnaniterator,string,ormatchobject

In[5]:re.split('\s+','Splitonspaces.')Out[5]:['Split','on','spaces.']

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Introductiontotokenization

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatistokenization?Turningastringordocumentintotokens(smallerchunks)OnestepinpreparingatextforNLPManydifferenttheoriesandrulesYoucancreateyourownrulesusingregularexpressionsSomeexamples:

BreakingoutwordsorsentencesSeparatingpunctuationSeparatingallhashtagsinatweet

DataCamp NaturalLanguageProcessingFundamentalsinPython

nltklibrarynltk:naturallanguagetoolkit

In[1]:fromnltk.tokenizeimportword_tokenize

In[2]:word_tokenize("Hithere!")Out[2]:['Hi','there','!']

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whytokenize?EasiertomappartofspeechMatchingcommonwordsRemovingunwantedtokens"Idon'tlikeSam'sshoes.""I","do","n't","like","Sam","'s","shoes","."

DataCamp NaturalLanguageProcessingFundamentalsinPython

Othernltktokenizers

sent_tokenize:tokenizeadocumentintosentences

regexp_tokenize:tokenizeastringordocumentbasedonaregularexpressionpattern

TweetTokenizer:specialclassjustfortweettokenization,allowingyoutoseparatehashtags,mentionsandlotsofexclamationpoints!!!

DataCamp NaturalLanguageProcessingFundamentalsinPython

MoreregexpracticeDifferencebetweenre.search()andre.match()

In[1]:importre

In[2]:re.match('abc','abcde')Out[2]:<_sre.SRE_Matchobject;span=(0,3),match='abc'>

In[3]:re.search('abc','abcde')Out[3]:<_sre.SRE_Matchobject;span=(0,3),match='abc'>

In[4]:re.match('cd','abcde')

In[5]:re.search('cd','abcde')Out[5]:<_sre.SRE_Matchobject;span=(2,4),match='cd'>

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Advancedtokenizationwithregex

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Regexgroupsusingor"|"ORisrepresentedusing|

Youcandefineagroupusing()

Youcandefineexplicitcharacterrangesusing[]In[1]:importre

In[2]:match_digits_and_words=('(\d+|\w+)')

In[3]:re.findall(match_digits_and_words,'Hehas11cats.')Out[3]:['He','has','11','cats']

DataCamp NaturalLanguageProcessingFundamentalsinPython

Regexrangesandgroupspattern matches example

[A-Za-z]+ upperandlowercaseEnglishalphabet 'ABCDEFghijk'

[0-9] numbersfrom0to9 9

[A-Za-z\-\.]+

upperandlowercaseEnglishalphabet,-and.

'My-Website.com'

(a-z) a,-andz 'a-z'

(\s+l,) spacesoracomma ','

DataCamp NaturalLanguageProcessingFundamentalsinPython

Characterrangewithre.match()In[1]:importre

In[2]:my_str='matchlowercasespacesnumslike12,butnocommas'

In[3]:re.match('[a-z0-9]+',my_str)Out[3]:<_sre.SRE_Matchobject;span=(0,42),match='matchlowercasespacesnumslike12'>

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Chartingwordlengthwithnltk

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Gettingstartedwithmatplotlib

ChartinglibraryusedbymanyopensourcePythonprojectsStraightforwardfunctionalitywithlotsofoptions

HistogramsBarchartsLinechartsScatterplots

...andalsoadvancedfunctionalitylike3Dgraphsandanimations!

DataCamp NaturalLanguageProcessingFundamentalsinPython

PlottingahistogramwithmatplotlibIn[1]:frommatplotlibimportpyplotasplt

In[2]:plt.hist([1,5,5,7,7,7,9])Out[2]:(array([1.,0.,0.,0.,0.,2.,0.,3.,0.,1.]),array([1.,1.8,2.6,3.4,4.2,5.,5.8,6.6,7.4,8.2,9.]),<alistof10Patchobjects>)

In[3]:plt.show()

DataCamp NaturalLanguageProcessingFundamentalsinPython

GeneratedHistogram

DataCamp NaturalLanguageProcessingFundamentalsinPython

CombiningNLPdataextractionwithplottingIn[1]:frommatplotlibimportpyplotasplt

In[2]:fromnltk.tokenizeimportword_tokenize

In[3]:words=word_tokenize("Thisisaprettycooltool!")

In[4]:word_lengths=[len(w)forwinwords]

In[5]:plt.hist(word_lengths)Out[5]:(array([2.,0.,1.,0.,0.,0.,3.,0.,0.,1.]),array([1.,1.5,2.,2.5,3.,3.5,4.,4.5,5.,5.5,6.]),<alistof10Patchobjects>)

In[6]:plt.show()

DataCamp NaturalLanguageProcessingFundamentalsinPython

Wordlengthhistogram

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON