Prezdev parsing & crawling libs

Parsing & Crawling libs

WE DON'T USE

1 / 13

Beautiful Soupbuilt on top of lxml and html5libhigher levels commandshandles encoding itself

example :

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc)soup.title# <title>The Dormouse's story</title>

soup.title.name# u'title'

soup.title.string# u'The Dormouse's story'

soup.title.parent.name# u'head'

soup.p# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']# u'title'

soup.a# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 2 / 13

Beautiful Soup - Y U no use me ?yet a new kind of soupgotta go a step lowercrappy acronym

3 / 13

html5libimplements the WHATWG HTML5 specification.will inject tbody and suchis actually usable directly in lxml, we could use it

4 / 13

html5lib - Y U no use me ?Y would I ? uuhh

5 / 13

Scrapywrite rulesbuilt-in handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etcextendable : middlewares, extensions, and pipelinesWeb management console for monitoring and controlling your botTelnet console for low-level access to the Scrapy process

from scrapy.item import Item, Field

class TorrentItem(Item): url = Field() name = Field() description = Field() size = Field()

from scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import Selector

class MininovaSpider(CrawlSpider):

name = 'mininova' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

def parse_torrent(self, response): sel = Selector(response) torrent = TorrentItem() torrent['url'] = response.url torrent['name'] = sel.xpath("//h1/text()").extract() torrent['description'] = sel.xpath("//div[@id='description']").extract() torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract() return torrent

6 / 13

Scrapy Shellscrapy shell"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

[s] Available Scrapy objects:[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>[s] item {}[s] request <GET http://scrapy.org>[s] response <200 http://scrapy.org>[s] sel <Selector xpath=None data=u'<html>\n <head>\n <meta charset="utf-8'>[s] settings <CrawlerSettings module=None>[s] spider <Spider 'default' at 0x20c6f50>[s] Useful shortcuts:[s] shelp() Shell help (print this help)[s] fetch(req_or_url) Fetch request (or URL) and update local objects[s] view(response) View response in a browser

In [1]: sel.xpath('//title')Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

7 / 13

Scrapy SettingsCONCURRENT_ITEMSCONCURRENT_REQUESTSCONCURRENT_REQUESTS_PER_DOMAINCONCURRENT_REQUESTS_PER_IP

8 / 13

Scrappy - Y U no use me ?I want to !how to integrate scrapy daemon with MRQ ?have to implement a proxies rotating middleware

9 / 13

Scrapy - Scrapinghub

10 / 13

Did I miss something ?mechanize, twill => shitty deprecated crawling modulesi forgot their names => black boxes paid services

11 / 13

Did I miss something ?

GET LARGE

12 / 13

Adrien Di Pasquale

16/05/2014

13 / 13

Prezdev parsing & crawling libs

Software

Transcript of Prezdev parsing & crawling libs