Tutorial on Web Scraping in Python
-
Upload
nithish-raghunandanan -
Category
Technology
-
view
144 -
download
2
Transcript of Tutorial on Web Scraping in Python
Scraping Data from the Web using Scrapy & Beautiful Soup
Nithish Raghunandanan
PyData Munich | 8th November 2017
About Me● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
What is Scraping?● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
Use Cases
Tools for Scraping● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Scraping 101● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
Pitfalls in Crawling● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
Why Yellow Pages? Email Marketing for Customer Acquisition
Email Marketing for Customer AcquisitionInitial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
nithishr1
@nithishr
Connect
Nithish Raghunandanan
www.ki-labs.com
Resources● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping