Pydata-Python tools for webscraping
-
Upload
jose-manuel-ortega-candel -
Category
Technology
-
view
771 -
download
2
Transcript of Pydata-Python tools for webscraping
![Page 1: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/1.jpg)
Python tools for webscraping
José Manuel Ortega
@jmortegac
![Page 2: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/2.jpg)
SpeakerDeck space https://speakerdeck.com/jmortega
![Page 3: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/3.jpg)
Github repository https://github.com/jmortega/pydata_webscraping
![Page 4: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/4.jpg)
Agenda
Scraping techniques
Introduction to webscraping
Python tools for webscraping
Scrapy project
![Page 5: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/5.jpg)
Scraping techniques
Screen scraping
Report mining
Web scraping
Spiders /Crawlers
![Page 6: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/6.jpg)
Screen scraping
Selenium
Mechanize
Robobrowser
![Page 7: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/7.jpg)
Selenium
Open Source framework for automating browsers
Python-Module
http://pypi.python.org/pypi/selenium
pip install selenium
Firefox-Driver
![Page 8: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/8.jpg)
Selenium
find_element_ by_link_text(‘text’): find the link by text
by_css_selector: just like with lxml css
by_tag_name: ‘a’ for the first link or all links
by_xpath: practice xpath regex
by_class_name: CSS related, but this finds
all different types that have the same class
![Page 9: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/9.jpg)
Selenium youtube
![Page 10: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/10.jpg)
Selenium youtube search
![Page 11: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/11.jpg)
Report mining
Miner
![Page 12: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/12.jpg)
Webscraping
![Page 13: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/13.jpg)
Python tools
Requests
Beautiful Soup 4
Pyquery
Webscraping
Scrapy
![Page 14: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/14.jpg)
Spiders /crawlers
A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider.
https://en.wikipedia.org/wiki/Web_crawler
![Page 15: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/15.jpg)
Spiders /crawlers
![Page 16: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/16.jpg)
Spiders /crawlers
scrapinghub.com
![Page 17: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/17.jpg)
Requests http://docs.python-requests.org/en/latest
![Page 18: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/18.jpg)
Requests
![Page 19: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/19.jpg)
Web scraping with Python 1. Download webpage with requests
2. Parse the page with BeautifulSoup/lxml
3. Select elements with Regular
expressions,XPath or css selectors
![Page 20: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/20.jpg)
Xpath selectors Expression Meaning
name matches all nodes on the current level with the specified name
name[n] matches the nth element on the current level with the specified name
/ Do selection from the root
// Do selection from current node
* matches all nodes on the current level
. Or .. Select current / parent node
@name the attribute with the specified name
[@key='value'] all elements with an attribute that matches the specified key/value pair
name[@key='value'] all elements with the specified name and an attribute that matches the specified key/value pair
[text()='value'] all elements with the specified text
name[text()='value'] all elements with the specified name and text
![Page 21: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/21.jpg)
BeautifulSoup
Parsers support lxml,html5lib
Installation
pip install lxml
pip install html5lib
pip install beautifulsoup4
http://www.crummy.com/software/BeautifulSoup
![Page 22: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/22.jpg)
BeautifulSoup
soup = BeautifulSoup(html_doc,’lxml’)
Print all: print(soup.prettify())
Print text: print(soup.get_text())
from bs4 import BeautifulSoup
![Page 23: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/23.jpg)
BeautifulSoup functions find_all(‘a’)Returns all links
find(‘title’)Returns the first element <title>
get(‘href’)Returns the attribute href value
(element).text Returns the text inside an
element
for link in soup.find_all('a'): print(link.get('href'))
![Page 24: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/24.jpg)
External/internal links
![Page 25: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/25.jpg)
External/internal links
http://pydata.org/madrid2016
![Page 26: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/26.jpg)
Webscraping pip install webscraping
#Download instance D = download.Download() #get page html = D.get('http://pydata.org/madrid2016/schedule/') #get element where is located information xpath.search(html, '//td[@class="slot slot-talk"]')
![Page 27: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/27.jpg)
Pydata agenda code structure
![Page 28: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/28.jpg)
Extract data from pydata agenda
![Page 29: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/29.jpg)
PyQuery
![Page 30: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/30.jpg)
![Page 31: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/31.jpg)
![Page 32: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/32.jpg)
Scrapy installation
pip install scrapy
![Page 33: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/33.jpg)
Scrapy
Uses a mechanism based on XPath expressions called Xpath Selectors.
Uses Parser LXML to find elements
Twisted for asyncronous operations
![Page 34: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/34.jpg)
Scrapy advantages Faster than mechanize because it
uses asynchronous operations (Twisted).
Scrapy has better support for html parsing.
Scrapy has better support for unicode characters, redirections, gzipped responses, encodings.
You can export the extracted data directly to JSON,XML and CSV.
![Page 35: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/35.jpg)
Architecture
![Page 36: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/36.jpg)
Scrapy Shell
scrapy shell <url>
from scrapy.select import Selector hxs = Selector(response) Info = hxs.select(‘//div[@class=“slot-inner”]’)
![Page 37: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/37.jpg)
Scrapy Shell
scrapy shell http://scrapy.org
![Page 38: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/38.jpg)
Scrapy project $ scrapy startproject <project_name>
scrapy.cfg: the project configuration file.
tutorial/:the project’s python module.
items.py: the project’s items file.
pipelines.py : the project’s pipelines file.
setting.py : the project’s setting file.
spiders/ : spiders directory.
![Page 39: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/39.jpg)
Pydata conferences
![Page 40: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/40.jpg)
Spider generating
$ scrapy genspider -t basic <SPIDER_NAME> <DOMAIN>
$ scrapy list
Spiders list
![Page 41: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/41.jpg)
Pydata spyder
![Page 42: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/42.jpg)
Pydata sypder
![Page 43: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/43.jpg)
Pipelines ITEM_PIPELINES =
{'pydataSchedule.pipelines.PyDataSQLitePipeline': 100, 'pydataSchedule.pipelines.PyDataJSONPipeline':200,}
pipelines.py
![Page 44: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/44.jpg)
Pydata SQLitePipeline
![Page 45: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/45.jpg)
Execution
$ scrapy crawl <spider_name>
$ scrapy crawl <spider_name> -o items.json -t json
$ scrapy crawl <spider_name> -o items.csv -t csv
$ scrapy crawl <spider_name> -o items.xml -t xml
![Page 46: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/46.jpg)
Pydata conferences
![Page 47: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/47.jpg)
Pydata conferences
![Page 48: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/48.jpg)
Pydata conferences
![Page 49: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/49.jpg)
Launch spiders without scrapy command
![Page 50: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/50.jpg)
Scrapy Cloud http://doc.scrapinghub.com/scrapy-cloud.html
https://dash.scrapinghub.com
>>pip install shub >>shub login >>Insert your ScrapingHub API Key:
![Page 51: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/51.jpg)
Scrapy Cloud /scrapy.cfg
# Project: demo [deploy] url =https://dash.scrapinghub.com/api/scrapyd/ #API_KEY username = ec6334d7375845fdb876c1d10b2b1622 password = #project identifier project = 25767
![Page 52: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/52.jpg)
Scrapy Cloud
$ shub deploy
![Page 53: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/53.jpg)
Scrapy Cloud
![Page 54: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/54.jpg)
Scrapy Cloud
![Page 55: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/55.jpg)
Scrapy Cloud
![Page 56: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/56.jpg)
Scrapy Cloud Scheduling
curl -u APIKEY: https://dash.scrapinghub.com/api/schedule.json -d project=PROJECT -d spider=SPIDER
![Page 57: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/57.jpg)
References
http://www.crummy.com/software/BeautifulSoup
http://scrapy.org
https://pypi.python.org/pypi/mechanize
http://docs.webscraping.com
http://docs.python-requests.org/en/latest
http://selenium-python.readthedocs.org/index.html
https://github.com/REMitchell/python-scraping
![Page 58: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/58.jpg)
Books
![Page 59: Pydata-Python tools for webscraping](https://reader031.fdocuments.in/reader031/viewer/2022021502/58a1853d1a28abca5a8b59d5/html5/thumbnails/59.jpg)
Thank you!