Tutorial on Web Scraping in Python

12
Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan [email protected] PyData Munich | 8th November 2017

Transcript of Tutorial on Web Scraping in Python

Page 1: Tutorial on Web Scraping in Python

Scraping Data from the Web using Scrapy & Beautiful Soup

Nithish Raghunandanan

[email protected]

PyData Munich | 8th November 2017

Page 2: Tutorial on Web Scraping in Python

About Me● MSc. Informatics Student at the Technical University of Munich

○ Focus on Data Science & Software Engineering

● Student Employee at KI labs, part of KI Group

● Love to play with different technologies

● Connect

■ nithishr1

@nithishr

Page 3: Tutorial on Web Scraping in Python

What is Scraping?● Extract data from the web pages

● Store the data into structured formats

● Data not available directly or via APIs

Page 4: Tutorial on Web Scraping in Python

Use Cases

Page 5: Tutorial on Web Scraping in Python

Tools for Scraping● Scrapy

○ Python framework to extract data from web pages

● Beautiful Soup

○ Python library to parse HTML/XML documents

● Alternatives

○ Selenium

○ Requests

○ Octoparse

Page 6: Tutorial on Web Scraping in Python
Page 7: Tutorial on Web Scraping in Python

Scraping 101● Spider

○ A bot that downloads web pages

● robots.txt

○ File present on the server specifying access limits to bots

Page 8: Tutorial on Web Scraping in Python

Pitfalls in Crawling● Javascript heavy websites

○ Splash plugin

○ Selenium

● Default settings not too friendly to website

owners

○ Inbuilt Auto throttle extension

● Captchas

Page 9: Tutorial on Web Scraping in Python

Why Yellow Pages? Email Marketing for Customer Acquisition

Page 10: Tutorial on Web Scraping in Python

Email Marketing for Customer AcquisitionInitial Approach

● Buy Email Lists

● Send via 3rd Parties

● Poor Quality

○ Non transparent

○ Generic emails

● Expensive

Crawling

● Scrapy + Beautiful Soup

● Over 500k Emails

● Quality Improvement

○ Categorized into segments

○ Targeted emails

● Cheap

Page 11: Tutorial on Web Scraping in Python

nithishr1

@nithishr

[email protected]

Connect

Nithish Raghunandanan

www.ki-labs.com

Page 12: Tutorial on Web Scraping in Python

Resources● Scrapy Guide

○ https://doc.scrapy.org/en/latest/intro/tutorial.html

● Beautiful Soup Guide

○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/

● Crawling Etiquette

○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/

● Code

○ https://github.com/nithishr/meetup_scraping