Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web...

34
Web scraping with Scrapy

Transcript of Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web...

Page 1: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Web scrapingwith Scrapy

Page 2: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Web crawlera program that systematically browses the web

Page 3: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Web crawlerstarts with list of URLs to visit (seeds)

Page 4: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Web crawleridentifies links, adds them to list of URLs to visit

Page 5: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Uses

Scraping

Archiving

Indexing

Validation

Spambots

Page 6: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Crawling policy and robots.txt

Selection policy

Revisit policy

Politeness policy

Parallelization policy

Page 7: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Examples

Page 8: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Scrapyweb scraping framework for Python

Page 9: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

pip install scrapy

Page 10: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

scrapy startproject tutorial

tutorial├── scrapy.cfg└── tutorial ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders └── __init__.py

Creating a project

Page 11: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

cd tutorial/tutorial/spiderstouch bwog_spider.pyopen bwog_spider.py

Creating a spider

Page 12: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapy

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):print 'Testing ' + response.url

Creating a spider

Page 13: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

cd ../..scrapy crawl bwog

Running the spider

Page 14: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Structure of a document

Page 15: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

XPathLanguage for selecting nodes in XML documents

Page 16: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

CSS SelectorsSelects nodes based on their CSS style

Each stylesheet rule has a selector pattern that matches a set of HTML elements

Page 17: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Obtaining the XPathChrome Developer Tools

InspectCopy > Copy XPath

Firebug HTML Panel

Inspect Element with FirebugCopy XPath

Page 18: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Using XPath selectors

//div[@class="blog-section"]

Page 19: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapy

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):for section in response.xpath('//div[@class="blog-section"]'):

title = section.xpath('.//a/text()').extract()[0]print title

Using XPath selectors

Page 20: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

scrapy crawl bwog

Junior Phi Beta Kappa Inductees AnnouncedUntil Next Year, ColumbiaOverseen: Historical Revisionism On The Columbia Facebook PageBwog’s Resolutions For The New YearBwog In Bed: New Level EditionActual Wisdom: Philip ProtterBwog PoetryBunsen Bwog: If This Is The Apocalypse, At Least We’ll Find LoveBwog In Bed: Is This The Real Life? EditionField Notes: Barely Hanging On EditionActual Wisdom: J.C. SalyerShit Our Notes Say“Baking” With Bwog: The Many Uses of OreganoBwog In Bed: Be Your Own Hero EditionDark Night Of The (Final) Soul

Running the spider

Page 21: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapy

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):for section in response.xpath('//div[@class="blog-section"]'):

link = section.xpath('.//a/@href').extract()[0]yield scrapy.Request(link, callback=self.parse_entry)

def parse_entry(self, response):print response.url

Scrapy Requests

Page 22: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

scrapy crawl bwog

http://bwog.com/2016/01/06/junior-phi-beta-kappa-inductees-announced/http://bwog.com/2015/12/23/until-next-year-columbia/http://bwog.com/2015/12/23/overseen-historical-revisionism-on-the-columbia-facebook-page/http://bwog.com/2015/12/23/bwogs-resolutions-for-the-new-year/http://bwog.com/2015/12/23/bwog-in-bed-new-level-edition/http://bwog.com/2015/12/22/actual-wisdom-philip-protter/http://bwog.com/2015/12/22/bwog-poetry/http://bwog.com/2015/12/22/bunsen-bwog-if-this-is-the-apocalypse-at-least-well-find-love/http://bwog.com/2015/12/22/bwog-in-bed-is-this-the-real-life-edition/http://bwog.com/2015/12/21/field-notes-barely-hanging-on-edition/http://bwog.com/2015/12/21/actual-wisdom-j-c-salyer/http://bwog.com/2015/12/21/shit-our-notes-say/http://bwog.com/2015/12/21/baking-with-bwog-the-many-uses-of-oregano/http://bwog.com/2015/12/21/bwog-in-bed-be-your-own-hero-edition/http://bwog.com/2015/12/20/dark-night-of-the-final-soul/

Running the spider

Page 23: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Extracting comments

//div[@class="reg-comment-body "]

Page 24: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapy

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):for section in response.xpath('//div[@class="blog-section"]'):

link = section.xpath('.//a/@href').extract()[0]yield scrapy.Request(link, callback=self.parse_entry)

def parse_entry(self, response):for comment in response.xpath('//div[@class="reg-comment-body "]'):

paragraphs = comment.xpath('.//text()').extract()text = '\n'.join(paragraphs).strip()print text

Extracting comments

Page 25: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

open tutorial/items.py

Scrapy Items

Page 26: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapy

class TutorialItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() content = scrapy.Field()

Scrapy Items

Page 27: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapyfrom tutorial.items import TutorialItem

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):for section in response.xpath('//div[@class="blog-section"]'):

link = section.xpath('.//a/@href').extract()[0]yield scrapy.Request(link, callback=self.parse_entry)

def parse_entry(self, response):for comment in response.xpath('//div[@class="reg-comment-body "]'):

paragraphs = comment.xpath('.//text()').extract()text = '\n'.join(paragraphs).strip()item = TutorialItem()item['content'] = textyield item

Extracting comments

Page 28: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Storing commentsscrapy crawl bwog -o comments.json

Page 29: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Following links

//div[@class="comnt-btn"]//@href

Page 30: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

import scrapyfrom tutorial.items import TutorialItem

class BwogSpider(scrapy.Spider):name = 'bwog'allowed_domains = ['bwog.com']start_urls = ['http://bwog.com/']

def parse(self, response):for section in response.xpath('//div[@class="blog-section"]'):

link = section.xpath('.//a/@href').extract()[0]yield scrapy.Request(link, callback=self.parse_entry)

next = response.xpath('//div[@class="comnt-btn"]//@href').extract()[0]yield scrapy.Request(next, callback=self.parse)

def parse_entry(self, response):for comment in response.xpath('//div[@class="reg-comment-body "]'):

paragraphs = comment.xpath('.//text()').extract()text = '\n'.join(paragraphs).strip()item = TutorialItem()item['content'] = textyield item

Following links

Page 31: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Wait for it...scrapy crawl bwog -o comments.json

Page 32: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

...and on it goes!

Page 33: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

Next steps

Scrapy Shell (interactive shell console)

Feed Exports (JSON, CSV, XML)

Item Pipelines (writing to MongoDB, duplicates filter)

Page 34: Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

doc.scrapy.orgfor more information