Web scraping in python

Web Scraping with Python

Virendra Rajput,

Hacker @Markitty

Agenda

● What is scraping● Why we scrape● My experiments with web scraping● How do we do it● Tools to use● Online demo● Some more tools● Ethics for scraping

converting unstructured documents into structured information

scraping:

What is Web Scraping?

● Web scraping (web harvesting) is a software technique of extracting information from websites

● It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed

RSS is meta data and not HTML replacement

Why we scrape?

● Web pages contain wealth of information (in text form), designed mostly for human consumption

● Static websites (legacy systems)● Interfacing with 3rd party with no API access● Websites are more important than API’s● The data is already available (in the form of

web pages)● No rate limiting● Anonymous access

How search engines use it

My Experiments with Scraping

and more..!

IMDb API

Did you mean!

Facebook Bot for Brahma Kumaris

Getting started!

Fetching the data

● Involves finding the endpoint - URL or URL’s● Sending HTTP requests to the server● Using requests library:

import requests

data = requests.get(‘http://google.com/’)

html = data.content

Processing (say no to Reg-ex)

● use reg-ex ● Avoid using reg-ex● Reasons why not to use it:

1. Its fragile2. Really hard to maintain3. Improper HTML & Encoding handling

Use BeautifulSoup for parsing

● Provides simple methods to-○ search○ navigate○ select

● Deals with broken web-pages really well● Auto-detects encoding

Philosophy-“You didn't write that awful page. You're just trying to get

some data out of it. Beautiful Soup is here to help.”

Export the data

● Database (relational or non-relational)● CSV● JSON● File (XML, YAML, etc.)● API

Live example demo

Challenges

● External sites can change without warning○ Figuring out the frequency is difficult (TEST, and

test)○ Changes can break scrapers easily

● Bad HTTP status codes○ example: using 200 OK to signal an error○ cannot always trust your HTTP libraries default

behaviour● Messy HTML markup

Mechanize

● Stateful web-browsing with mechanize○ Fill up forms○ Follow links○ Handle cookies○ Browse history

● After Andy Lester’s WWW:Mechanize

Filling forms with Mechanize

Scrapy - a framework for web scraping

● Uses XPath to select elements● Interactive shell scripting● Using Scrapy:

○ define a model to store items○ create your spider to extract items○ write a Pipeline to store them

Conclusion

● Scrape wisely● Do not steal● Use cloud● Share your scrapers scraperwiki.com

The End!

Virendra Rajput

http://virendra.me/http://twitter.com/bkvirendra

Web scraping in python

Technology

Transcript of Web scraping in python

Python Scraping Showdown

Scraping web pages

Web scraping com python

Web scraping

Networked Programs - Dr. Chuck · Networked Programs Chapter 12 Python for Informatics: Exploring Information . Internet Client ... What is Web Scraping?

Python Web Scraping - tutorialspoint.com · Python Web Scraping i About the Tutorial Web scraping, also called web data mining or web harvesting, is the process of constructing an

Overview of python web scraping tools

Python, web scraping and content management: Scrapy and Django

BeautifulSoup: Web Scraping with · PDF fileIntroductionExampleRegexOther MethodsPDFs BeautifulSoup: Web Scraping with Python Andrew Peterson Apr 9, 2013 les available at:

Web Scraping with Python - Sample Chapter

“Viewing” Web Pages In Python - University of Michigan · “Viewing” Web Pages In Python Charles Severance - . What is Web Scraping?

Scraping the web with python

Optimization of a Scheduler for a Web Scraping System · an existing web scraping system. Web scraping is a process where a program is collecting all info from a website (\scraping"),

RxJS, ggplot2, Python Data Persistence, Caffe2, …...data. When we do scraping or extracting data or feeds from the web (like from web-pages or websites), it is termed as web-scraping.

Scrapy and Elasticsearch: Powerful Web Scraping and ... · Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Rüegg Swiss Python Summit 2016, Rapperswil

Web Scrapping con Python y Selenium · Web Scrapping con Python y Selenium José Miguel Amaya Camacho Python Piura miguel.amaya99@gmail.com ... ¿Qué es Web Scraping? ... integra

Scraping with Python for Fun and Profit - PyCon India 2010

Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018 · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Spartan Experience Appcse498/2017-08/schedules/all-hands-mee… · •REST API with AWS Lambda (Python 3.6) •Web scraping public MSU websites Python web scraping script Cron-based