Web scraping and social media scraping { handling...
Transcript of Web scraping and social media scraping { handling...
IntroductionTypical solutions
Web scraping and social media scraping –handling JS
Jacek Lewkowicz, Dorota Celinska-Kopczynska
University of Warsaw
April 9, 2019
IntroductionTypical solutions
JavaScriptA typical problem
What will we be working on today?
Most of modern websites use JavaScript (JS)
With JS the content of the website is generated dynamically
Which may make scraping content impossible or significantlymore difficult:
1 A part of website may not be rendered correctly2 Access to some areas may be granted upon clicking a button
IntroductionTypical solutions
JavaScriptA typical problem
Convention
In snippets, we will highlight in violet the areas where youmay put your own content
In commands, the areas in [] are optional
UNIX-like systems use “/” as the path separator and DOSuses “\”. In this presentation the paths will be writtenin UNIX-like convention if not stated otherwise
IntroductionTypical solutions
JavaScriptA typical problem
JavaScript
High-level dynamic, untyped interpreted run-time language
One of the three core languages related to web development(the most popular language of GitHub!)
Used to make dynamic webpages interactive and provideonline programs, including video games
IntroductionTypical solutions
JavaScriptA typical problem
Problem – getting blog content
Let us assume that we want to collect titles of the blog’sarticles
Looks easy! They are stored in a table. We have already donesimilar scrapers two weeks ago
IntroductionTypical solutions
JavaScriptA typical problem
Very basic spider
import scrapy
from scrapy import Request
class exItem(scrapy.Item):
title = scrapy.Field()
class exSpider(scrapy.Spider):
name = ’ex’
start_urls = [’http://your-site-here.com’]
def parse(self,response):
for i in range(0,13):
item = exItem()
item[’title’] = response.xpath(’//your/xpath/here/text()’).extract()[i]
yield item
IntroductionTypical solutions
JavaScriptA typical problem
Output of scraper
We do not extract anything... at all
Let us debug the code, e.g., with scrapy shell!
a blank page... but... it worked in the browser...
IntroductionTypical solutions
NaiveMature projectsSplash
Solution #1 – naive
Open the website in a browser
Save the source code of the website after the page is loaded
Work on a local copy of the source code
Pros: easy and sometimes may be a good workaround
Cons: tedious with limited possibilities, also slow
IntroductionTypical solutions
NaiveMature projectsSplash
Solution #2 – PhantomJS
headless browser
has no graphical interface, that is where the name originated(user looks like a ghost)
http://phantomjs.org/
Pros: usually a good workaround
Cons: limited possibilities, sometimes does not rendercorrectly, suspended development
IntroductionTypical solutions
NaiveMature projectsSplash
Solution #3 – Selenium
A web driver – you may work with various browsers via yourcode
http://www.seleniumhq.org/
Pros: a mature project, does not require human activity
Cons: nearly none, but for some reasons not covered duringthe course (:
IntroductionTypical solutions
NaiveMature projectsSplash
Scrapy + Selenium
# example from https://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page
import scrapyfrom selenium import webdriver
class ProductSpider(scrapy.Spider):name = "product_spider"allowed_domains = [’ebay.com’]start_urls = [’http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=
python&_sacat=0&_from=R40’]
def __init__(self):self.driver = webdriver.Firefox()
def parse(self, response):self.driver.get(response.url)
while True:next = self.driver.find_element_by_xpath(’//td[@class="pagn-next"]/a’)
try:next.click()
# get the data and write it to scrapy itemsexcept:
break
self.driver.close()
IntroductionTypical solutions
NaiveMature projectsSplash
Solution #4 – Splash
A monster child of Scrapy guys...
https://www.reddit.com/r/Python/comments/2xp5mr/handling_javascript_in_scrapy_with_splash/cp2vgd6/
Pros: relatively easy to use, does not require human activity,great Scrapy integration
Cons: probably a lot
IntroductionTypical solutions
NaiveMature projectsSplash
Splash – Installation
pip install scrapy-splash
Typically one works with an instance of Splash in a docker
docker run -p 8050:8050 scrapinghub/splash usuallyis enough
IntroductionTypical solutions
NaiveMature projectsSplash
Splash – Configuration in settings.py
1 Add Splash server address:SPLASH_URL = ’http://192.168.59.103:8050’
2 Enable Splash middlewareDOWNLOADER_MIDDLEWARES = {
’scrapy_splash.SplashCookiesMiddleware’: 723,
’scrapy_splash.SplashMiddleware’: 725,
’scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810,
}
3 Enable Spider middlewaresSPIDER_MIDDLEWARES = {
’scrapy_splash.SplashDeduplicateArgsMiddleware’: 100,
}
4 Set a custom Dupefilter ClassDUPEFILTER_CLASS = ’scrapy_splash.SplashAwareDupeFilter’
5 for more options see https://github.com/scrapy-plugins/scrapy-splash
IntroductionTypical solutions
NaiveMature projectsSplash
Adding Splash to Scrapy code
import scrapy
from scrapy import Request
class exItem(scrapy.Item):
title = scrapy.Field()
class exSpider(scrapy.Spider):
name = ’ex’
start_urls = [’http://your-site-here.com’]
# A convenient way is to parse information about splash to start_requests metadata
# this setup can be used in any project (always looks the same)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={’splash’:{’endpoint’:’render.html’,
’args’:{’wait’:0.5,}}})
def parse(self,response):
for i in range(0,13):
item = exItem()
item[’title’] = response.xpath(’//your-xpath-here/text()’).extract()[i]
yield item
IntroductionTypical solutions
NaiveMature projectsSplash
Output with Scrapy-Splash
IntroductionTypical solutions
NaiveMature projectsSplash
Output of Scrapy shell
scrapy shell ’http://localhost:8050/render.html?url=http://your-site-here.com&timeout=1wait=0.5’
IntroductionTypical solutions
NaiveMature projectsSplash
Additional links and tutorials
https://dzone.com/articles/
perform-actions-using-javascript-in-python-seleniu
https://simpletutorials.com/c/2205/Basic%20Web%20Scraper%20using%
20Python%2C%20Selenium%2C%20and%20PhantomJS
https://www.guru99.com/execute-javascript-selenium-webdriver.html
https://www.datacamp.com/community/tutorials/
scraping-javascript-generated-data-with-r
https://gist.github.com/hrbrmstr/dc62bb2b35617e9badc5
https://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/