Python and Web Data Extraction: IntroductionJul 01, 2016 · Python Basics 2. Web Scraping 3....
Transcript of Python and Web Data Extraction: IntroductionJul 01, 2016 · Python Basics 2. Web Scraping 3....
PythonandWebDataExtraction:Introduction
Alvin Zuyin [email protected]
http://community.mis.temple.edu/zuyinzheng/
About
• AboutMe:Alvin Zuyin Zheng ,MIS
• Following up with the workshop in May
• Organizers:Sudipta Basu, Lalitha Naveen, JingGong
• Thisworkshop issupportedbytheOfficeofResearchandDoctoralProgramsatFox.ThanksPaul,Lindsayandeveryone intheOfficeofResearch!
About
• StudentAssistants:– ShawnJNiederriter– Xue Guo– Zhe Deng
• Website:http://community.mis.temple.edu/zuyinzheng/pythonworkshop/
• Maintenance from 12:00 to 2:00pm!
Topics
1. PythonBasics2. WebScraping3. IntroductiontoNaturalLanguageProcessing
• Nopriorprogrammingexperienceneeded
Schedule9:50am WelcomeandSetUp
10:00am Session1–Pythonbasics11:00am CoffeeBreak11:20am Session2–WebScraping(Part1)12:20pm LunchBreak1:20pm Session3–WebScraping(Part2)2:20pm CoffeeBreak2:40pm Session4– Basic IntrotoNatural
LanguageProcessing3:45pm ClosingRemarksandQuestions
Prerequisites• Beforetheworkshop,yourcomputerneedsthefollowingtoolsinstalledandworkingtoparticipate.– Acommand-line interface tointeractwithyourcomputer
– Atexteditortoworkwithplaintextfiles– Python2.7– Thepip packagemanagerforPython– A browser that can view web source code like Chrome
(Pleasefollowthesetupguidepostedhere)
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
WhyPython?• Simple• Easytolearn• Freeandopensource• Portableacrossplatforms• Withextensivelibraries
• Python2versus3:– Verydifferent– WewillusethelatestversionofPython2(LatestversionisPython2.7.12)
PythonIDLE(InteractiveShell)• ThePython IDLE providesaninteractive environment to
playwiththelanguage
• OpenPython IDLE– OnWindows:taptheWindowskeyonyourkeyboardandtype“idle”
toopenthe“IDLE(PythonGUI)”
– OnMac:useCmd+Space andtype“idle”toselectthe“IDLE.”
PythonIDLE(InteractiveShell)
>>> 1+34>>> workshop = "Python">>> workshop'Python'>>> print "hello"hello
•Youcantypecommands directlyintotheinteractiveshell
•Resultsofexpressionsareprintedonthescreen
Variablesareassignedusingthe“=“sign
Indentation• Pythonuses indentation(usuallyfourspaces) tostructureablockofcodes– nocurlybraces {} tomarkwherethefunctioncodestartsandstops
>>> x = 10
>>> if x>10:
... print "x is larger than 10"
... else:
... print "x is less than or equal to 10"
x is less than or equal to 10
Returns:
Comments
>>> print "hello" #this is a comment
Hello
>>> #this is a comment
>>>
• Comments: Textsmainlyusefulasnotesforthereaderofthescript.
• Aretotherightofthe#symbol
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
BasicDataTypes• Numbers– Integers
– Floats
–Calculations
>>> income = 100.2
>>> month = 3
>>> 100/month33>>> income/month33.4
BasicDataTypes• Strings– Specifystringsusingeithersinglequotesordoublequotes
– Usetriplequotesforstringsacrossmultiplelines
– Concatenate stringsusing“+”
>>> gender = "Male">>> name = 'Jack'
>>> paragraph = """This is a long paragraph with multiple lines."""
>>> name + gender'JackMale'
List[]
• List:Anorderedcollectionofdata– Youcanhaveanything inalist:
– Uselen()togetthelengthofalist
>>> [0]>>> [2.3, 4.5]>>> [5, "Hello", "there", 9.8]>>> range(4)[0, 1, 2, 3]
>>> names=["Ben","Jack","Lee","Nick"] >>> len(names)4
AccessingandUpdatingValuesinLists• Use[]toaccessvaluesinthelist
• Updatevalues
>>> names=["Ben","Jack","Lee","Nick"] >>> names[0]'Ben'>>> names[1:3]['Jack', 'Lee']>>> names[1:]['Jack', 'Lee', 'Nick']>>> [names[i] for i in [1,3]] ['Jack', 'Nick']
>>> names[1]="Ann">>> names['Ben', 'Ann', 'Lee', 'Nick']
[0]:the1st value[1]:the2nd value…
TolearnmoreaboutPythonLists:Visithere
OtherTypes• Dictionaries{}– consistofkey-valuepairs.
• Tuples()– Similartolist,butcannotbeupdated.
>>> dict = {'name': 'Ann', 'age': 10}
>>> tup = ('Ann', 10)>>> tup[1]=12Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>tup[1]=12
TypeError: 'tuple' object does not support item assignment
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
If• Theif statementchecksacondition
• Andcanbecombinedwith…elif …else
>>> age = 10>>> if age>10:... print "age is greater than 10"... elif age<10:... print "age is less than 10"... else:... print "age is equal to 10"
age is equal to 10
For
• Thefor statementloopsiterateovereachvalueinalist
>>> for i in range(4):... print "i: ",i
i: 0i: 1i: 2i: 3
Break
>>> for i in range(4):... print "i: ",i... if i>1:... print "Exiting the for loop"... break
i: 0i: 1i: 2Exiting the for loop
• break canbeusedtostoptheforloop
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
Packages(Modules)
• Pythonitselfonlyhasalimitedsetoffunctionalities.
• Packagesprovideadditionalfunctionalities.
• Use“import”toloadapackage>>> import math>>> import os
InstallPackagesUsingpip• Forstandardpackages• Youdonotneedtobeinstalledmanually
• Forthird-partypackages• Usepip inyourcommand lineinterface toinstall
pip install SomePackage
• Forexample,toinstallthebeautifulsouppackage:pip install beautifulsoup4
Seethese instructionsforhowtoopenthecommandline interface.• OnWindowsitiscalled“CommandPrompt.”• OnMacitiscalled“Terminal.”
Functions
• Afunction isanamedsequenceofstatementsthatperformsadesiredoperation
• Defineafunction:>>> def f(x):... return x*2
>>> f(1)2
>>> f(0.5)1
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
• WorkingDirectory:ThinkofitasthefolderyourPythonisoperatinginsideatthemoment
• Getthecurrentworkingdirectory:
• Changethecurrentworkingdirectory:
WorkingDirectory
>>> import os>>> os.getcwd()'C:\\Python27'
>>> os.chdir("C:/users/jing")>>> os.getcwd()'C:\\users\\jing'
Useforwardslash (/)ordoublebackslash(\\)whenspecifyingdirectories.
• Touseafile,youhavetoopenitusingtheopen function
(Note:Thefollowingcodesassumesthatyourfilesareinyourcurrentworkingdirectory)
• Whendone,youhavetocloseit usingtheclose function
FileOpen/Close
>>> input_file = open("in.txt", "r")
>>> output_file = open("out.txt", "w")
>>> for line in input_file:
output_file.write(line) "r" meansreading"w" meanswriting"a" meansappending
>>> input_file.close()
>>> output_file.close()
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
• RegularExpressionsareapowerfultextmanipulationtoolfor– searching, replacing,andparsingtextpatterns
• Youneedtoloadthe“re”package
RegularExpressions(RE)
>>> import re
• Supposewehaveatextstring,andwanttoknowifthestringhastheword“cat”init…
• re.search(pattern, string[, flags])– findthefirst locationwherethepattern producesamatch
Exactmatch:Anexample
>>> import re
>>> teststring = """ A fat cat doesn't eat oat but a rat eats bats. """
>>> match = re.search("cat",teststring)
>>> print match.group()
cat
" A fat cat doesn't eat oat but a rat eats bats. "
String:
ExtractingTexts
• Howtoextract everythingbetween“cat”and“rat”?
• Wecandefineapattern "cat(.*?)rat"– (.*?) representseverything inbetween“cat”and“rat”
>>> match = re.search("cat(.*?)rat",teststring)>>> print match.group(0)cat doesn't eat oat but a rat>>> print match.group(1)doesn't eat oat but a
"A fat cat doesn't eat oat but a rat eats bats."
String:
FindAllMatchedSubtrings
• re.findall(pattern, string[, flags])– Findallmatchedsubstringswithagivenpattern
• ".at" isapatternthatmatchesanyof'fat','cat','eat','oat', 'rat','eat'(or'1at','2at',‘aat',‘bat‘…)
>>> import re
>>> teststring = """ A fat cat doesn't eat oat but a rat eats bats. """ >>> match = re.findall(".at",teststring)
>>> print match
['fat', 'cat', 'eat', 'oat', 'rat', 'eat']
EscapeCharacter• Specialcharactersoftencauseproblemsbecausetheyare
usedtodefinepatterns
• Ifyouwantaspecialcharacter tojustbehavenormally(mostofthetime)youprefixitwithbackslash(\ )
• Escape Meaning\\ \\ ' '\ " "\n newline\t tab\r return\. .
MoreaboutRegularExpression• Python’sofficialRegularExpressionHOWTO:– https://docs.python.org/2/howto/regex.html#regex-howto
• Google’sPythonRegularExpressionTutorial:– https://developers.google.com/edu/python/regular-expressions
• Testyourregularexpression:– https://regex101.com/#python
Outline
• Overview• DataTypes• ControlFlow• PackagesandFunctions• FileInput/Output• RegularExpression• Tutorial1.FirstRunningtheFirstPythonScript
Tutorial1:RunningYourFirstPythonScript• DownloadtheFirstPythonScript.py filefromthewebsiteandtrytorunit
• Steps1. Locatethe.py fileyou’d liketoruninyourfolder2. Openthethe .py filewithIDLE3. Clickthe“Run”menuandchoose“RunModule”
OnlineRecoursesforPython• Python'sBeginnersGuide listedmanyonlinebooksandtutorials:
https://wiki.python.org/moin/BeginnersGuide/Programmers
• Python'sOfficialTutorial: https://docs.python.org/2/tutorial/
• PythonBasicTutorialbyTutorialsPoint: http://www.tutorialspoint.com/python/
• learnpython.org: http://www.learnpython.org/
• Google'sPythonClass: https://developers.google.com/edu/python/