Web scraping in python

21
Web Scraping with Python Virendra Rajput, Hacker @Markitty

description

It is a getting started guide to web scraping with Python and was presented at Dev Fest Google Developers Group Pune.

Transcript of Web scraping in python

Page 1: Web scraping in python

Web Scraping with Python

Virendra Rajput,

Hacker @Markitty

Page 2: Web scraping in python

Agenda

● What is scraping● Why we scrape● My experiments with web scraping● How do we do it● Tools to use● Online demo● Some more tools● Ethics for scraping

Page 3: Web scraping in python

converting unstructured documents into structured information

scraping:

Page 4: Web scraping in python

What is Web Scraping?

● Web scraping (web harvesting) is a software technique of extracting information from websites

● It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed

Page 5: Web scraping in python

RSS is meta data and not HTML replacement

Page 6: Web scraping in python

Why we scrape?

● Web pages contain wealth of information (in text form), designed mostly for human consumption

● Static websites (legacy systems)● Interfacing with 3rd party with no API access● Websites are more important than API’s● The data is already available (in the form of

web pages)● No rate limiting● Anonymous access

Page 7: Web scraping in python

How search engines use it

Page 8: Web scraping in python

My Experiments with Scraping

Page 10: Web scraping in python

Getting started!

Page 11: Web scraping in python

Fetching the data

● Involves finding the endpoint - URL or URL’s● Sending HTTP requests to the server● Using requests library:

import requests

data = requests.get(‘http://google.com/’)

html = data.content

Page 12: Web scraping in python

Processing (say no to Reg-ex)

● use reg-ex ● Avoid using reg-ex● Reasons why not to use it:

1. Its fragile2. Really hard to maintain3. Improper HTML & Encoding handling

Page 13: Web scraping in python

Use BeautifulSoup for parsing

● Provides simple methods to-○ search○ navigate○ select

● Deals with broken web-pages really well● Auto-detects encoding

Philosophy-“You didn't write that awful page. You're just trying to get

some data out of it. Beautiful Soup is here to help.”

Page 14: Web scraping in python

Export the data

● Database (relational or non-relational)● CSV● JSON● File (XML, YAML, etc.)● API

Page 15: Web scraping in python

Live example demo

Page 16: Web scraping in python

Challenges

● External sites can change without warning○ Figuring out the frequency is difficult (TEST, and

test)○ Changes can break scrapers easily

● Bad HTTP status codes○ example: using 200 OK to signal an error○ cannot always trust your HTTP libraries default

behaviour● Messy HTML markup

Page 17: Web scraping in python

Mechanize

● Stateful web-browsing with mechanize○ Fill up forms○ Follow links○ Handle cookies○ Browse history

● After Andy Lester’s WWW:Mechanize

Page 18: Web scraping in python

Filling forms with Mechanize

Page 19: Web scraping in python

Scrapy - a framework for web scraping

● Uses XPath to select elements● Interactive shell scripting● Using Scrapy:

○ define a model to store items○ create your spider to extract items○ write a Pipeline to store them

Page 20: Web scraping in python

Conclusion

● Scrape wisely● Do not steal● Use cloud● Share your scrapers scraperwiki.com

Page 21: Web scraping in python

The End!

Virendra Rajput

http://virendra.me/http://twitter.com/bkvirendra