Webscraping with asyncio

78
Webscraping with Asyncio José Manuel Ortega @jmortegac

Transcript of Webscraping with asyncio

Webscrapingwith Asyncio

José Manuel Ortega@jmortegac

Python conferenceshttps://speakerdeck.com/jmortega

Python conferenceshttp://jmortega.github.io/

Github repositoryhttps://github.com/jmortega/webscraping_asyncio_2016

Agenda▶ Webscraping python tools▶ Requests vs aiohttp▶ Introduction to asyncio▶ Async client/server▶ Building a webcrawler with asyncio▶ Alternatives to asyncio

Webscraping

Python tools➢ Requests➢ Beautiful Soup 4➢ Pyquery➢ Webscraping➢ Scrapy

Python tools➢ Mechanize➢ Robobrowser➢ Selenium

Requests http://docs.python-requests.org/en/latest

Web scraping with Python

1. Download webpage with HTTP module(requests,urllib,aiohttp)

2. Parse the page with BeautifulSoup/lxml

3. Select elements with Regular expressions,XPath or css selectors

4. Store results in a database,csv,json

BeautifulSoup

BeautifulSoup▶ soup =

BeautifulSoup(html_doc,’html.parser’)▶ Print all: print(soup.prettify())▶ Print text: print(soup.get_text())

from bs4 import BeautifulSoup

BeautifulSoup functions▪ find_all(‘a’)→Returns all links▪ find(‘title’)→Returns the first element <title>▪ get(‘href’)→Returns the attribute href value▪ (element).text → Returns the text inside an

element

for link in soup.find_all('a'):print(link.get('href'))

External/internal links

External/internal linkshttp://python.ie/pycon-2016/

BeautifulSoup PyCon

BeautifulSoup PyCon Output

Parsers Comparison

PyQuery

PyQuery

PyQuery output

Spiders /crawlers▶ A Web crawler is an Internet bot that

systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider.

https://en.wikipedia.org/wiki/Web_crawler

Spiders /crawlers

scrapinghub.com

Scrapyhttps://pypi.python.org/pypi/Scrapy/1.1.2

Scrapy▶ Uses a mechanism based on XPath

expressions called Xpath Selectors.

▶ Uses Parser LXML to find elements▶ Twisted for asynchronous

operations

Scrapy advantages▶ Faster than mechanize because it

uses twisted for asynchronous operations.▶ Scrapy has better support for html

parsing.▶ Scrapy has better support for unicode

characters, redirections, gzipped responses, encodings.

▶ You can export the extracted data directly to JSON,XML and CSV.

Export data▶ scrapy crawl <spider_name>▶ $ scrapy crawl <spider_name> -o items.json -t json▶ $ scrapy crawl <spider_name> -o items.csv -t csv▶ $ scrapy crawl <spider_name> -o items.xml -t xml

The concurrency problem▶ Different approaches:▶ Multiple processes▶ Threads▶ Separate distributed machines▶ Asynchronous programming(event

loop)

Requests problems▶ Requests operations are blocking the

main thread▶ It pauses until operation completed▶ We need one thread for each request if

we want non-blocking operations

Threads problems

▶ Get Overhead▶ Stack size▶ Context changes▶ Synchronization

Solution

▶NOT USE THREADS▶USE ONE THREAD▶+ EVENT LOOP

New concepts▶ Event loop▶ Async▶ Await▶ Futures▶ Coroutines▶ Tasks▶ Executors

Event loop implementations▶ Asyncio▶ https://docs.python.org/3.4/library/asyncio.html

▶ Tornado web server▶ http://www.tornadoweb.org/en/stable

▶ Twisted ▶ https://twistedmatrix.com

▶ Gevent ▶ http://www.gevent.org

Asyncio def.

Asyncio▶ Python >=3.3▶ Event-loop framework▶ I/O Asynchronous▶ Non-blocking approach with sockets▶ All requests in one thread▶ Event-driven switching▶ aio-http module for make requests

asynchronously

Asyncio▶ Interoperatibility with other frameworks

Requests vs aiohttp

#!/usr/local/bin/python3.5import asynciofrom aiohttp import ClientSessionasync def hello():

async with ClientSession() as session: async with session.get("http://httpbin.org/headers") as response: response = await response.read() print(response)

loop = asyncio.get_event_loop()loop.run_until_complete(hello())

import requestsdef hello() return requests.get("http://httpbin.org/get")print(hello())

Event Loop▶ An event loop allow us to write asynchronous

code using callbacks or coroutines.▶ Event loop function like task switcher,just the way

operating systems switch between active tasks on the CPU.

▶ The idea is that we have an event loop running until all tasks scheduled are completed.

▶ Features and tasks are created through the event loop.

Event Loop▶ An event loop is used to orchestrate the

execution of the coroutines.▶ asyncio.get_event_loop()

▶ asyncio.run_until_complete(coroutines,futures)▶ asyncio.run_forever()

▶ asyncio.stop()

Starting Event Loop

Coroutines▶ Coroutines are functions that allow for

multitasking without requiring multiple threads or processes.

▶ Coroutines are like functions, but they can be suspended or resumed at certain points in the code.

▶ Coroutines allow write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded.

Coroutines 3.4 vs 3.5import asyncio

@asyncio.coroutine def fetch(self, url): response = yield from self.session.get(url) body = yield from response.read()

import asyncio

async def fetch(self, url): response = await self.session.get(url) body = await response.read()

Coroutines in event loop#!/usr/local/bin/python3.5

import asyncioimport aiohttp

async def get_page(url): response = await aiohttp.request('GET', url) body = await response.read() print(body)

loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait([get_page('http://python.org'), get_page('http://pycon.org')]))

Requests in event loopasync def getpage_with_requests(url):

return await loop.run_in_executor(None,requests.get,url)

#methods equivalents

async def getpage_with_aiohttp(url):

with aitohttp.ClientSession() as session:

async with session.get(url) as response:

return await response.read()

Tasks▶ The asyncio.Task class is a subclass of

asyncio.Future to encapsulate and manage coroutines.

▶ Allow independently running tasks to run concurrently with other tasks on the same event loop.

▶ When a coroutine is wrapped in a task, it connects the task to the event loop.

Tasks

Tasks

Tasks

Tasks execution

Futures▶ To manage an object Future in Asyncio, we

must declare the following:▶ import asyncio▶ future = asyncio.Future()▶ https://docs.python.org/3/library/asyncio

-task.html#future▶ https://docs.python.org/3/library/concurr

ent.futures.html

Futures▶ The asyncio.Future class is essentially a

promise of a result.▶ A Future will returns the results when they

are available, and once it receives results, it will pass them along to all the registered callbacks.

▶ Each future is a task to be executed in the event loop

Futures

Semaphores▶ Adding synchronization▶ Limiting number of concurrent requests.▶ The argument indicates the number of

simultaneous requests we want to allow.▶ sem = asyncio.Semaphore(5)

with (await sem): page = await get(url, compress=True)

Async Client /server▶ asyncio.start_server▶ server =

asyncio.start_server(handle_connection,host=HOST,port=PORT)

Async Client /server▶ asyncio.open_connection

Async Client /server

Async Web crawler

Async Web crawler▶ Send asynchronous requests to all the links

on a web page and add the responses to a queue to be processed as we go.

▶ Coroutines allow running independent tasks and processing their results in 3 ways:

▶ Using asyncio.as_completed →by processing the results as they come.

▶ Using asyncio.gather→ only once they have all finished loading.

▶ Using asyncio.ensure_future

Async Web crawlerimport asyncioimport random

@asyncio.coroutinedef get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time

@asyncio.coroutinedef process_results_as_come_in(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] for coroutine in asyncio.as_completed(coroutines): url, wait_time = yield from coroutine print('Coroutine for {} is done'.format(url))

def main(): loop = asyncio.get_event_loop() print(“Process results as they come in:") loop.run_until_complete(process_results_as_come_in()) if __name__ == '__main__': main()

asyncio.as_completed

Async Web crawler execution

Async Web crawlerimport asyncioimport random

@asyncio.coroutinedef get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time

@asyncio.coroutinedef process_once_everything_ready(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.gather(*coroutines) print(results)

def main(): loop = asyncio.get_event_loop() print(“Process results once they are all ready:") loop.run_until_complete(process_once_everything_ready()) if __name__ == '__main__': main()

asyncio.gather

asyncio.gatherFrom Python documentation, this is what asyncio.gather does:

asyncio.gather(*coros_or_futures, loop=None,

return_exceptions=False)

Return a future aggregating results from the given coroutine

objects or futures.

All futures must share the same event loop. If all the tasks

are done successfully, the returned future’s result is the

list of results (in the order of the original sequence, not

necessarily the order of results arrival). If

return_exceptions is True, exceptions in the tasks are

treated the same as successful results, and gathered in the

result list; otherwise, the first raised exception will be

immediately propagated to the returned future.

Async Web crawlerimport asyncioimport random

@asyncio.coroutinedef get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time

@asyncio.coroutinedef process_ensure_future(): tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.wait(tasks) print(results)

def main(): loop = asyncio.get_event_loop() print(“Process ensure future:") loop.run_until_complete(process_ensure_future()) if __name__ == '__main__': main()

asyncio.ensure_future

Async Web crawler execution

Async Web downloader

Async Web downloader faster

Async Web downloader

▶ With get_partial_content

▶ With download_coroutine

Async Extracting links with r.e

Async Extracting links with bs4

Async Extracting links execution▶ With bs4

▶ With regex

Parallel python▶ SMP(symmetric multiprocessing)

architecture with multiple cores in the same machine

▶ Distribute tasks in multiple machines

▶ Cluster

ProcessPoolExecutornumber_of_cpus = cpu_count()

Books

Books

Thank you!

@jmortegac

http://speakerdeck.com/jmortega

http://github.com/jmortega