Scraping web pages
-
Upload
dulce-nina -
Category
Documents
-
view
233 -
download
1
Transcript of Scraping web pages
-
8/18/2019 Scraping web pages
1/48
“Viewing” Web Pages
In PythonCharles Severance - www.dr-chuck.com
-
8/18/2019 Scraping web pages
2/48
What is Web Scraping?
• When a program or script pretends to be a browser and retrievesweb pages, looks at those web pages, extracts information and thenlooks at more web pages.
http://en.wikipedia.org/wiki/Web_scraping
-
8/18/2019 Scraping web pages
3/48
ServerGET
HTML
GET
HTML
-
8/18/2019 Scraping web pages
4/48
Why Scrape?
• Pull data - particularly social data - who links to who?
• Get your own data back out of some system that has no “exportcapability”
• Monitor a site for new information
-
8/18/2019 Scraping web pages
5/48
Scraping Web Pages
• There is some controversy about web page scraping and some sitesare a bit snippy about it.
• Google: facebook scraping block
• Republishing copyrighted information is not allowed
• Violating terms of service is not allowed
-
8/18/2019 Scraping web pages
6/48
http://www.facebook.com/terms.php
-
8/18/2019 Scraping web pages
7/48
http://www.myspace.com/index.cfm?fuseaction=misc.terms
Looks like a loophole... So we will play a bit with MySpace - be respectful - look but never touch..
-
8/18/2019 Scraping web pages
8/48
Web ProtocolsThe Request / Response Cycle
-
8/18/2019 Scraping web pages
9/48
Web Standards
• HTML - HyperText Markup Language - a way to describe how pagesare supposed to look and act in a web browser
• HTTP - HyperText Transport Protocol - how your Browsercommunicates with a web server to get more HTML pages and senddata to the web server
-
8/18/2019 Scraping web pages
10/48
What is so “Hyper”about the web?
• If you think of the whole web as a“space” like “outer space”
• When you click on a link at one point inspace - you instantly “hyper-transport”to another place in the space
• It does not matter how far apart thetwo web pages are
-
8/18/2019 Scraping web pages
11/48
HTML - View Source
•The hard way to learn HTML is tolook at the source to many web pages.
• There are lots of less than < andgreater than > signs
• Buying a good book is much easier
http://www.sitepoint.com/books/html1/
-
8/18/2019 Scraping web pages
12/48
-
8/18/2019 Scraping web pages
13/48
-
8/18/2019 Scraping web pages
14/48
HTML - Brief tutorial
Start a hyperlink Where to go What to show End a hyperlink
-
8/18/2019 Scraping web pages
15/48
-
8/18/2019 Scraping web pages
16/48
HyperText Transport Protocol
• HTTP describes how your browser talks to a web server to get thenext page.
• That next page will use HTML
• The way the pages are retrieved is HTTP
-
8/18/2019 Scraping web pages
17/48
-
8/18/2019 Scraping web pages
18/48
Getting Data From The Server
• Each time the user clicks on an anchor tag with an href= value toswitch to a new page, the browser makes a connection to the webserver and issues a “GET” request - to GET the content of the page atthe specified URL
• The server returns the HTML document to the Browser whichformats and displays the document to the user.
-
8/18/2019 Scraping web pages
19/48
HTTP Request / Response Cycle
http://www.oreilly.com/openbook/cgi/ch04_02.html
Browser
Web Server
HTTPRequest HTTPResponse
Internet Explorer,FireFox, Safari, etc.
-
8/18/2019 Scraping web pages
20/48
HTTP Request / Response Cycle
GET /index.html
Accept: www/sourceAccept: text/htmlUser-Agent: Lynx/2.4 libwww/2.14
http://www.oreilly.com/openbook/cgi/ch04_02.html
Browser
Web Server
HTTPRequest HTTPResponse
..
Welcome to myapplication
....
-
8/18/2019 Scraping web pages
21/48
“Hacking” HTTP
Last login: Wed Oct 10 04:20:19 on ttyp2si-csev-mbp:~ csev$ telnet www.umich.edu 80Trying 141.211.144.188...Connected to www.umich.edu.Escape character is '^]'.GET /
....
HTTPRequest
HTTPResponse
Browser
Web Server
-
8/18/2019 Scraping web pages
22/48
-
8/18/2019 Scraping web pages
23/48
HTML and HTTP in Python
-
8/18/2019 Scraping web pages
24/48
Using urllib to retrieve web pages
-
8/18/2019 Scraping web pages
25/48
http://docs.python.org/lib/module-urllib.html
-
8/18/2019 Scraping web pages
26/48
• You get the entire web page when you do f.read() - lines are separatedby a “newline” character “\n”
-
8/18/2019 Scraping web pages
27/48
• You get the entire web page when you do f.read() - lines are separatedby a “newline” character “\n”
• We can split the contents into lines using the split() function
\n\n
\n\n
-
8/18/2019 Scraping web pages
28/48
• Splitting the contents on thenewline character gives use anice list where each entry is asingle line
• We can easily write a for loopto look through the lines
>>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]
>>>
for ln in lines: # Do something for each line
-
8/18/2019 Scraping web pages
29/48
Parsing HTML
• We could treat the HTML as XML - but most HTML is not wellformed enough to be truly XML
• So we end up with ad hoc parsing
• For each line look for some trigger value
•If you find your trigger value - parse out the information you wantusing string manipulation
-
8/18/2019 Scraping web pages
30/48
Looking for links
Start a hyperlink Where to go What to show End a hyperlink
-
8/18/2019 Scraping web pages
31/48
-
8/18/2019 Scraping web pages
32/48
here.
0123
http://docs.python.org/lib/string-methods.html
pos = ln.find('href="')
-
8/18/2019 Scraping web pages
33/48
0123
here. 456789
pos = ln.find('href="')
etc = ln[pos+6:]
http://www.dr-chuck.com/">here.
Six characters
-
8/18/2019 Scraping web pages
34/48
etc = ln[pos+6:]
http://www.dr-chuck.com/">here.
0123456789012345678901234
endpos = etc.find(‘“‘)
linktext = etc[:endpos]
endpos = 24
http://www.dr-chuck.com/
-
8/18/2019 Scraping web pages
35/48
print "* Found link at", pos
etc = ln[pos+6:] print "Chopped off front bit", etc endpos = etc.find('"') print "End of link at",endpos
linktext = etc[:endpos] print "Link text", linktext
-
8/18/2019 Scraping web pages
36/48
Looking at
-
8/18/2019 Scraping web pages
37/48
for ln in lines: print "Looking at", ln pos = ln.find('href="') if pos > -1 :
linktext = None try: print "* Found link at", pos
etc = ln[pos+6:] print "Chopped off front bit", etc endpos = etc.find('"') print "End of link at",endpos if endpos > 0: linktext = etc[:endpos]
except: print "Could not parse link",ln print "Link text", linktext
The final bit with a bit ofparanoia in the form of a
try / except block in casesomething goes wrong.
No need to blow up with atraceback - just move to thenext line and look for a link.
-
8/18/2019 Scraping web pages
38/48
python links.py
Looking at
Looking at Hello there my name is Chuck.Looking at
Looking atLooking at Go ahead and click onLooking at here.
* Found link at 3Chopped off front bit http://www.dr-chuck.com/">here.End of link at 24Link text http://www.dr-chuck.com/Looking at
-
8/18/2019 Scraping web pages
39/48
My Space
-
8/18/2019 Scraping web pages
40/48
-
8/18/2019 Scraping web pages
41/48
http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=125104617
-
8/18/2019 Scraping web pages
42/48
-
8/18/2019 Scraping web pages
43/48
0 : # print line try: rest = line[pos+len(friendurl):] print "Rest of the line", rest endquote = rest.find('"')
if endquote > 0 : newfriend = rest[:endquote] print newfriend
-
8/18/2019 Scraping web pages
44/48
if newfriend in friends : print "Already in list", newfriend else : print "Adding friend", newfriend
friends.append(newfriend)
# Make an empty listfriends = list()friends.append("125104617")friends.append("51910594")friends.append("230923259")
-
8/18/2019 Scraping web pages
45/48
Demo
-
8/18/2019 Scraping web pages
46/48
Assignment 10
• Build a simple Python program to prompt for a URL, retrieve data andthen print the number of lines and characters
• Add a feature to the myspace spider to find the average age of a set offriends.
-
8/18/2019 Scraping web pages
47/48
charles-severances-macbook-air:assn-10 csev$ python returl.pyEnter a URL:http://www.dr-chuck.comRetrieving: http://www.dr-chuck.comServer Data Retrieved 95256 characters and 2243 linesEnter a URL:http://www.umich.edu/Retrieving: http://www.umich.edu/Server Data Retrieved 26730 characters and 361 linesEnter a URL:http://www.pythonlearn.com/Retrieving: http://www.pythonlearn.com/Server Data Retrieved 95397 characters and 2241 linesEnter a URL:
charles-severances-macbook-air:assn-10 csev$
-
8/18/2019 Scraping web pages
48/48
Summary
•Python can easily retrieve data from the web and use its powerful
string parsing capabilities to sift through the information and makesense of the information
• We can build a simple directed web-spider for our own purposes
• Make sure that we do not violate the terms and conditions of a webseit and make sure not to use copyrighted material improperly