Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3....

40
Python and Web Data Extraction: Introduction Alvin Zuyin Zheng [email protected] http://community.mis.temple.edu/zuyinzheng/

Transcript of Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3....

Page 1: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

PythonandWebDataExtraction:Introduction

Alvin Zuyin [email protected]

http://community.mis.temple.edu/zuyinzheng/

Page 2: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

About

• AboutMe:Alvin Zuyin Zheng ,MIS

• Following up with the workshop in May

• Organizers:Sudipta Basu, Lalitha Naveen, JingGong

• Thisworkshop issupportedbytheOfficeofResearchandDoctoralProgramsatFox.ThanksPaul,Lindsayandeveryone intheOfficeofResearch!

Page 3: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

About

• StudentAssistants:– ShawnJNiederriter– Xue Guo– Zhe Deng

• Website:http://community.mis.temple.edu/zuyinzheng/pythonworkshop/

• Maintenance from 12:00 to 2:00pm!

Page 4: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Topics

1. PythonBasics2. WebScraping3. IntroductiontoNaturalLanguageProcessing

• Nopriorprogrammingexperienceneeded

Page 5: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Schedule9:50am WelcomeandSetUp

10:00am Session1–Pythonbasics11:00am CoffeeBreak11:20am Session2–WebScraping(Part1)12:20pm LunchBreak1:20pm Session3–WebScraping(Part2)2:20pm CoffeeBreak2:40pm Session4– Basic IntrotoNatural

LanguageProcessing3:45pm ClosingRemarksandQuestions

Page 6: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

PythonandWebDataExtraction:PythonBasics

[email protected]

http://community.mis.temple.edu/gong

Page 7: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Prerequisites• Beforetheworkshop,yourcomputerneedsthefollowingtoolsinstalledandworkingtoparticipate.– Acommand-line interface tointeractwithyourcomputer

– Atexteditortoworkwithplaintextfiles– Python2.7– Thepip packagemanagerforPython– A browser that can view web source code like Chrome

(Pleasefollowthesetupguidepostedhere)

Page 8: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 9: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

WhyPython?• Simple• Easytolearn• Freeandopensource• Portableacrossplatforms• Withextensivelibraries

• Python2versus3:– Verydifferent– WewillusethelatestversionofPython2(LatestversionisPython2.7.12)

Page 10: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

PythonIDLE(InteractiveShell)• ThePython IDLE providesaninteractive environment to

playwiththelanguage

• OpenPython IDLE– OnWindows:taptheWindowskeyonyourkeyboardandtype“idle”

toopenthe“IDLE(PythonGUI)”

– OnMac:useCmd+Space andtype“idle”toselectthe“IDLE.”

Page 11: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

PythonIDLE(InteractiveShell)

>>> 1+34>>> workshop = "Python">>> workshop'Python'>>> print "hello"hello

•Youcantypecommands directlyintotheinteractiveshell

•Resultsofexpressionsareprintedonthescreen

Variablesareassignedusingthe“=“sign

Page 12: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Indentation• Pythonuses indentation(usuallyfourspaces) tostructureablockofcodes– nocurlybraces {} tomarkwherethefunctioncodestartsandstops

>>> x = 10

>>> if x>10:

... print "x is larger than 10"

... else:

... print "x is less than or equal to 10"

x is less than or equal to 10

Returns:

Page 13: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Comments

>>> print "hello" #this is a comment

Hello

>>> #this is a comment

>>>

• Comments: Textsmainlyusefulasnotesforthereaderofthescript.

• Aretotherightofthe#symbol

Page 14: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 15: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

BasicDataTypes• Numbers– Integers

– Floats

–Calculations

>>> income = 100.2

>>> month = 3

>>> 100/month33>>> income/month33.4

Page 16: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

BasicDataTypes• Strings– Specifystringsusingeithersinglequotesordoublequotes

– Usetriplequotesforstringsacrossmultiplelines

– Concatenate stringsusing“+”

>>> gender = "Male">>> name = 'Jack'

>>> paragraph = """This is a long paragraph with multiple lines."""

>>> name + gender'JackMale'

Page 17: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

List[]

• List:Anorderedcollectionofdata– Youcanhaveanything inalist:

– Uselen()togetthelengthofalist

>>> [0]>>> [2.3, 4.5]>>> [5, "Hello", "there", 9.8]>>> range(4)[0, 1, 2, 3]

>>> names=["Ben","Jack","Lee","Nick"] >>> len(names)4

Page 18: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

AccessingandUpdatingValuesinLists• Use[]toaccessvaluesinthelist

• Updatevalues

>>> names=["Ben","Jack","Lee","Nick"] >>> names[0]'Ben'>>> names[1:3]['Jack', 'Lee']>>> names[1:]['Jack', 'Lee', 'Nick']>>> [names[i] for i in [1,3]] ['Jack', 'Nick']

>>> names[1]="Ann">>> names['Ben', 'Ann', 'Lee', 'Nick']

[0]:the1st value[1]:the2nd value…

TolearnmoreaboutPythonLists:Visithere

Page 19: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

OtherTypes• Dictionaries{}– consistofkey-valuepairs.

• Tuples()– Similartolist,butcannotbeupdated.

>>> dict = {'name': 'Ann', 'age': 10}

>>> tup = ('Ann', 10)>>> tup[1]=12Traceback (most recent call last):

File "<pyshell#25>", line 1, in <module>tup[1]=12

TypeError: 'tuple' object does not support item assignment

Page 20: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 21: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

If• Theif statementchecksacondition

• Andcanbecombinedwith…elif …else

>>> age = 10>>> if age>10:... print "age is greater than 10"... elif age<10:... print "age is less than 10"... else:... print "age is equal to 10"

age is equal to 10

Page 22: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

For

• Thefor statementloopsiterateovereachvalueinalist

>>> for i in range(4):... print "i: ",i

i: 0i: 1i: 2i: 3

Page 23: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Break

>>> for i in range(4):... print "i: ",i... if i>1:... print "Exiting the for loop"... break

i: 0i: 1i: 2Exiting the for loop

• break canbeusedtostoptheforloop

Page 24: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 25: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Packages(Modules)

• Pythonitselfonlyhasalimitedsetoffunctionalities.

• Packagesprovideadditionalfunctionalities.

• Use“import”toloadapackage>>> import math>>> import os

Page 26: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

InstallPackagesUsingpip• Forstandardpackages• Youdonotneedtobeinstalledmanually

• Forthird-partypackages• Usepip inyourcommand lineinterface toinstall

pip install SomePackage

• Forexample,toinstallthebeautifulsouppackage:pip install beautifulsoup4

Seethese instructionsforhowtoopenthecommandline interface.• OnWindowsitiscalled“CommandPrompt.”• OnMacitiscalled“Terminal.”

Page 27: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Functions

• Afunction isanamedsequenceofstatementsthatperformsadesiredoperation

• Defineafunction:>>> def f(x):... return x*2

>>> f(1)2

>>> f(0.5)1

Page 28: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 29: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

• WorkingDirectory:ThinkofitasthefolderyourPythonisoperatinginsideatthemoment

• Getthecurrentworkingdirectory:

• Changethecurrentworkingdirectory:

WorkingDirectory

>>> import os>>> os.getcwd()'C:\\Python27'

>>> os.chdir("C:/users/jing")>>> os.getcwd()'C:\\users\\jing'

Useforwardslash (/)ordoublebackslash(\\)whenspecifyingdirectories.

Page 30: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

• Touseafile,youhavetoopenitusingtheopen function

(Note:Thefollowingcodesassumesthatyourfilesareinyourcurrentworkingdirectory)

• Whendone,youhavetocloseit usingtheclose function

FileOpen/Close

>>> input_file = open("in.txt", "r")

>>> output_file = open("out.txt", "w")

>>> for line in input_file:

output_file.write(line) "r" meansreading"w" meanswriting"a" meansappending

>>> input_file.close()

>>> output_file.close()

Page 31: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 32: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

• RegularExpressionsareapowerfultextmanipulationtoolfor– searching, replacing,andparsingtextpatterns

• Youneedtoloadthe“re”package

RegularExpressions(RE)

>>> import re

Page 33: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

• Supposewehaveatextstring,andwanttoknowifthestringhastheword“cat”init…

• re.search(pattern, string[, flags])– findthefirst locationwherethepattern producesamatch

Exactmatch:Anexample

>>> import re

>>> teststring = """ A fat cat doesn't eat oat but a rat eats bats. """

>>> match = re.search("cat",teststring)

>>> print match.group()

cat

" A fat cat doesn't eat oat but a rat eats bats. "

String:

Page 34: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

ExtractingTexts

• Howtoextract everythingbetween“cat”and“rat”?

• Wecandefineapattern "cat(.*?)rat"– (.*?) representseverything inbetween“cat”and“rat”

>>> match = re.search("cat(.*?)rat",teststring)>>> print match.group(0)cat doesn't eat oat but a rat>>> print match.group(1)doesn't eat oat but a

"A fat cat doesn't eat oat but a rat eats bats."

String:

Page 35: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

FindAllMatchedSubtrings

• re.findall(pattern, string[, flags])– Findallmatchedsubstringswithagivenpattern

• ".at" isapatternthatmatchesanyof'fat','cat','eat','oat', 'rat','eat'(or'1at','2at',‘aat',‘bat‘…)

>>> import re

>>> teststring = """ A fat cat doesn't eat oat but a rat eats bats. """ >>> match = re.findall(".at",teststring)

>>> print match

['fat', 'cat', 'eat', 'oat', 'rat', 'eat']

Page 36: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

EscapeCharacter• Specialcharactersoftencauseproblemsbecausetheyare

usedtodefinepatterns

• Ifyouwantaspecialcharacter tojustbehavenormally(mostofthetime)youprefixitwithbackslash(\ )

• Escape Meaning\\ \\ ' '\ " "\n newline\t tab\r return\. .

Page 37: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

MoreaboutRegularExpression• Python’sofficialRegularExpressionHOWTO:– https://docs.python.org/2/howto/regex.html#regex-howto

• Google’sPythonRegularExpressionTutorial:– https://developers.google.com/edu/python/regular-expressions

• Testyourregularexpression:– https://regex101.com/#python

Page 38: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Outline

• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript

Page 39: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

Tutorial1:RunningYourFirstPythonScript• DownloadtheFirstPythonScript.py filefromthewebsiteandtrytorunit

• Steps1. Locatethe.py fileyou’d liketoruninyourfolder2. Openthethe .py filewithIDLE3. Clickthe“Run”menuandchoose“RunModule”

Page 40: Python and Web Data Extraction: IntroductionJul 01, 2016  · Python Basics 2. Web Scraping 3. Introduction to Natural Language Processing ... 11:20 am Session 2–Web Scraping (Part

OnlineRecoursesforPython• Python'sBeginnersGuide listedmanyonlinebooksandtutorials:

https://wiki.python.org/moin/BeginnersGuide/Programmers

• Python'sOfficialTutorial: https://docs.python.org/2/tutorial/

• PythonBasicTutorialbyTutorialsPoint: http://www.tutorialspoint.com/python/

• learnpython.org: http://www.learnpython.org/

• Google'sPythonClass: https://developers.google.com/edu/python/