Web scraping and social media scraping {...

$: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses$
Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Web scraping and social media scraping ndashintroduction

Jacek Lewkowicz Dorota Celinska-Kopczynska

University of Warsaw

March 18 2019



Homework

Motivation

Tons of (potentially useful) information on the Web

Instead of manual gathering the information needed weemploy computer programs to do it automatically

We have to understand the structure of the webpage and findgeneral scheme for data collection

Web scraping may be useful and efficient here



Homework

Web scraping

The process of taking unstructured information from theWeb

and processing it into structured data for later analysis

Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines

To scrape the web (=drag scraping) not to scrap (=discardscrapping)



Homework

Bots crawlers spiders scrapers

A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet

A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data

A scraper takes downloaded pages and attempts to extractdata from them

However there are no precise defitnions



Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media



Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors



Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter



Homework


Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf



Homework


Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face



Homework


Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private



Homework


robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608



Homework


Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710



Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy



Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access



Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo



Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable



Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information



Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you



Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues



Homework


Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup



Homework


Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install




Homework


Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues



Homework


Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming



Homework


What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains



Homework


Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml


Think about setting Python virtual environments(recommended)



pip install Scrapy

Other environments

Read the manual recommended use of anaconda



Homework


Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs



Homework


Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)



Homework


Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt



Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

Motivation and definitions

Controversies

Legal issues and famous cases

Preventing bots

Good practices

Introduction to software

Request library

Beautiful Soup

Scrapy

Homework



Homework

Motivation

Tons of (potentially useful) information on the Web

Instead of manual gathering the information needed weemploy computer programs to do it automatically

We have to understand the structure of the webpage and findgeneral scheme for data collection

Web scraping may be useful and efficient here



Homework

Web scraping







Homework








Homework



Download demo MP3





Homework


ScraperWiki

Google Refine

Outwit

SocSciBot




Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework

Web scraping







Homework








Homework



Download demo MP3





Homework


ScraperWiki

Google Refine

Outwit

SocSciBot




Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework








Homework



Download demo MP3





Homework


ScraperWiki

Google Refine

Outwit

SocSciBot




Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework



Download demo MP3





Homework


ScraperWiki

Google Refine

Outwit

SocSciBot




Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


ScraperWiki

Google Refine

Outwit

SocSciBot




Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


Is it legal








Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


Controversial cases





US vs Aaron Swartz






Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






celebrity_photos





Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






User-agent

Disallow private



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework










Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework








Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework








Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework







Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework










Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework







Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






Respect the law







Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


Requests library





Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


Beautiful what








Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework









Other environments





Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework







Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework


What is Scrapy







Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework



httpscrapyorg






pip install Scrapy

Other environments




Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework







Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework










Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework






DOWNLOAD DELAY






Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework



Homework

Homework




Controversies


Preventing bots

Good practices


Request library

Beautiful Soup

Scrapy

Homework

Web scraping and social media scraping {...

Documents

Transcript of Web scraping and social media scraping {...