Scraping web pages

download Scraping web pages

of 48

Transcript of Scraping web pages

  • 8/18/2019 Scraping web pages

    1/48

    “Viewing” Web Pages

    In PythonCharles Severance - www.dr-chuck.com

  • 8/18/2019 Scraping web pages

    2/48

    What is Web Scraping?

    • When a program or script pretends to be a browser and retrievesweb pages, looks at those web pages, extracts information and thenlooks at more web pages.

    http://en.wikipedia.org/wiki/Web_scraping

  • 8/18/2019 Scraping web pages

    3/48

    ServerGET

    HTML

    GET

    HTML

  • 8/18/2019 Scraping web pages

    4/48

    Why Scrape?

    • Pull data - particularly social data - who links to who?

    • Get your own data back out of some system that has no “exportcapability”

    • Monitor a site for new information

  • 8/18/2019 Scraping web pages

    5/48

    Scraping Web Pages

    • There is some controversy about web page scraping and some sitesare a bit snippy about it.

    • Google: facebook scraping block 

    • Republishing copyrighted information is not allowed

    • Violating terms of service is not allowed

  • 8/18/2019 Scraping web pages

    6/48

    http://www.facebook.com/terms.php

  • 8/18/2019 Scraping web pages

    7/48

    http://www.myspace.com/index.cfm?fuseaction=misc.terms

    Looks like a loophole... So we will play a bit with MySpace - be respectful - look but never touch..

  • 8/18/2019 Scraping web pages

    8/48

    Web ProtocolsThe Request / Response Cycle

  • 8/18/2019 Scraping web pages

    9/48

    Web Standards

    • HTML - HyperText Markup Language - a way to describe how pagesare supposed to look and act in a web browser

    • HTTP - HyperText Transport Protocol - how your Browsercommunicates with a web server to get more HTML pages and senddata to the web server

  • 8/18/2019 Scraping web pages

    10/48

    What is so “Hyper”about the web?

    • If you think of the whole web as a“space” like “outer space”

    • When you click on a link at one point inspace - you instantly “hyper-transport”to another place in the space

    • It does not matter how far apart thetwo web pages are

  • 8/18/2019 Scraping web pages

    11/48

    HTML - View Source

    •The hard way to learn HTML is tolook at the source to many web pages.

    • There are lots of less than < andgreater than > signs

    • Buying a good book is much easier

    http://www.sitepoint.com/books/html1/

  • 8/18/2019 Scraping web pages

    12/48

  • 8/18/2019 Scraping web pages

    13/48

  • 8/18/2019 Scraping web pages

    14/48

    HTML - Brief tutorial

    Start a hyperlink Where to go What to show End a hyperlink  

  • 8/18/2019 Scraping web pages

    15/48

  • 8/18/2019 Scraping web pages

    16/48

    HyperText Transport Protocol

    • HTTP describes how your browser talks to a web server to get thenext page.

    • That next page will use HTML

    • The way the pages are retrieved is HTTP

  • 8/18/2019 Scraping web pages

    17/48

  • 8/18/2019 Scraping web pages

    18/48

    Getting Data From The Server

    • Each time the user clicks on an anchor tag with an href= value toswitch to a new page, the browser makes a connection to the webserver and issues a “GET” request - to GET the content of the page atthe specified URL

    • The server returns the HTML document to the Browser whichformats and displays the document to the user.

  • 8/18/2019 Scraping web pages

    19/48

    HTTP Request / Response Cycle

    http://www.oreilly.com/openbook/cgi/ch04_02.html

    Browser

    Web Server

    HTTPRequest HTTPResponse

    Internet Explorer,FireFox, Safari, etc.

  • 8/18/2019 Scraping web pages

    20/48

    HTTP Request / Response Cycle

    GET /index.html

    Accept: www/sourceAccept: text/htmlUser-Agent: Lynx/2.4 libwww/2.14

    http://www.oreilly.com/openbook/cgi/ch04_02.html

    Browser

    Web Server

    HTTPRequest HTTPResponse

    ..

    Welcome to myapplication

     ....

  • 8/18/2019 Scraping web pages

    21/48

    “Hacking” HTTP

    Last login: Wed Oct 10 04:20:19 on ttyp2si-csev-mbp:~ csev$ telnet www.umich.edu 80Trying 141.211.144.188...Connected to www.umich.edu.Escape character is '^]'.GET /

     ....

    HTTPRequest

    HTTPResponse

    Browser

    Web Server

  • 8/18/2019 Scraping web pages

    22/48

  • 8/18/2019 Scraping web pages

    23/48

    HTML and HTTP in Python

  • 8/18/2019 Scraping web pages

    24/48

    Using urllib to retrieve web pages

  • 8/18/2019 Scraping web pages

    25/48

    http://docs.python.org/lib/module-urllib.html

  • 8/18/2019 Scraping web pages

    26/48

    • You get the entire web page when you do f.read() - lines are separatedby a “newline” character “\n”

  • 8/18/2019 Scraping web pages

    27/48

    • You get the entire web page when you do f.read() - lines are separatedby a “newline” character “\n”

    • We can split the contents into lines using the split() function

    \n\n

    \n\n

  • 8/18/2019 Scraping web pages

    28/48

    • Splitting the contents on thenewline character gives use anice list where each entry is asingle line

    • We can easily write a for loopto look through the lines

    >>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]

    >>>

    for ln in lines:  # Do something for each line

  • 8/18/2019 Scraping web pages

    29/48

    Parsing HTML

    • We could treat the HTML as XML - but most HTML is not wellformed enough to be truly XML

    • So we end up with ad hoc parsing

    • For each line look for some trigger value

    •If you find your trigger value - parse out the information you wantusing string manipulation

  • 8/18/2019 Scraping web pages

    30/48

    Looking for links

    Start a hyperlink Where to go What to show End a hyperlink  

  • 8/18/2019 Scraping web pages

    31/48

  • 8/18/2019 Scraping web pages

    32/48

    here.

    0123

    http://docs.python.org/lib/string-methods.html

      pos = ln.find('href="')

  • 8/18/2019 Scraping web pages

    33/48

    0123

    here.  456789

      pos = ln.find('href="')

      etc = ln[pos+6:]

    http://www.dr-chuck.com/">here.

    Six characters

  • 8/18/2019 Scraping web pages

    34/48

    etc = ln[pos+6:]

    http://www.dr-chuck.com/">here.

    0123456789012345678901234

    endpos = etc.find(‘“‘)

    linktext = etc[:endpos]

    endpos = 24

    http://www.dr-chuck.com/

  • 8/18/2019 Scraping web pages

    35/48

      print "* Found link at", pos

      etc = ln[pos+6:]  print "Chopped off front bit", etc  endpos = etc.find('"')  print "End of link at",endpos

      linktext = etc[:endpos]  print "Link text", linktext

  • 8/18/2019 Scraping web pages

    36/48

    Looking at

  • 8/18/2019 Scraping web pages

    37/48

    for ln in lines:  print "Looking at", ln  pos = ln.find('href="')  if pos > -1 :

    linktext = None  try:  print "* Found link at", pos

      etc = ln[pos+6:]  print "Chopped off front bit", etc  endpos = etc.find('"')  print "End of link at",endpos  if endpos > 0:  linktext = etc[:endpos]

      except:  print "Could not parse link",ln  print "Link text", linktext

    The final bit with a bit ofparanoia in the form of a

    try / except block in casesomething goes wrong.

    No need to blow up with atraceback - just move to thenext line and look for a link.

  • 8/18/2019 Scraping web pages

    38/48

    python links.py

    Looking at

    Looking at Hello there my name is Chuck.Looking at

    Looking at

    Looking at Go ahead and click onLooking at here.

    * Found link at 3Chopped off front bit http://www.dr-chuck.com/">here.End of link at 24Link text http://www.dr-chuck.com/Looking at

  • 8/18/2019 Scraping web pages

    39/48

    My Space

  • 8/18/2019 Scraping web pages

    40/48

  • 8/18/2019 Scraping web pages

    41/48

    http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=125104617

  • 8/18/2019 Scraping web pages

    42/48

  • 8/18/2019 Scraping web pages

    43/48

       0 :  # print line  try:  rest = line[pos+len(friendurl):]  print "Rest of the line", rest  endquote = rest.find('"')

      if endquote > 0 :  newfriend = rest[:endquote]  print newfriend

  • 8/18/2019 Scraping web pages

    44/48

      if newfriend in friends :  print "Already in list", newfriend  else :  print "Adding friend", newfriend

      friends.append(newfriend)

    # Make an empty listfriends = list()friends.append("125104617")friends.append("51910594")friends.append("230923259")

  • 8/18/2019 Scraping web pages

    45/48

    Demo

  • 8/18/2019 Scraping web pages

    46/48

    Assignment 10

    • Build a simple Python program to prompt for a URL, retrieve data andthen print the number of lines and characters

    • Add a feature to the myspace spider to find the average age of a set offriends.

  • 8/18/2019 Scraping web pages

    47/48

    charles-severances-macbook-air:assn-10 csev$ python returl.pyEnter a URL:http://www.dr-chuck.comRetrieving: http://www.dr-chuck.comServer Data Retrieved 95256 characters and 2243 linesEnter a URL:http://www.umich.edu/Retrieving: http://www.umich.edu/Server Data Retrieved 26730 characters and 361 linesEnter a URL:http://www.pythonlearn.com/Retrieving: http://www.pythonlearn.com/Server Data Retrieved 95397 characters and 2241 linesEnter a URL:

    charles-severances-macbook-air:assn-10 csev$

  • 8/18/2019 Scraping web pages

    48/48

    Summary

    •Python can easily retrieve data from the web and use its powerful

    string parsing capabilities to sift through the information and makesense of the information

    • We can build a simple directed web-spider for our own purposes

    • Make sure that we do not violate the terms and conditions of a webseit and make sure not to use copyrighted material improperly