Web scraping and social media scraping {...

29
Motivation and definitions Controversies Good practices Introduction to software Homework Web scraping and social media scraping – introduction Jacek Lewkowicz, Dorota Celi´ nska-Kopczy´ nska University of Warsaw March 18, 2019

Transcript of Web scraping and social media scraping {...

Page 1: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Web scraping and social media scraping ndashintroduction

Jacek Lewkowicz Dorota Celinska-Kopczynska

University of Warsaw

March 18 2019

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Motivation

Tons of (potentially useful) information on the Web

Instead of manual gathering the information needed weemploy computer programs to do it automatically

We have to understand the structure of the webpage and findgeneral scheme for data collection

Web scraping may be useful and efficient here

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Web scraping

The process of taking unstructured information from theWeb

and processing it into structured data for later analysis

Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines

To scrape the web (=drag scraping) not to scrap (=discardscrapping)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Bots crawlers spiders scrapers

A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet

A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data

A scraper takes downloaded pages and attempts to extractdata from them

However there are no precise defitnions

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 2: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Motivation

Tons of (potentially useful) information on the Web

Instead of manual gathering the information needed weemploy computer programs to do it automatically

We have to understand the structure of the webpage and findgeneral scheme for data collection

Web scraping may be useful and efficient here

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Web scraping

The process of taking unstructured information from theWeb

and processing it into structured data for later analysis

Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines

To scrape the web (=drag scraping) not to scrap (=discardscrapping)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Bots crawlers spiders scrapers

A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet

A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data

A scraper takes downloaded pages and attempts to extractdata from them

However there are no precise defitnions

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 3: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Web scraping

The process of taking unstructured information from theWeb

and processing it into structured data for later analysis

Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines

To scrape the web (=drag scraping) not to scrap (=discardscrapping)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Bots crawlers spiders scrapers

A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet

A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data

A scraper takes downloaded pages and attempts to extractdata from them

However there are no precise defitnions

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 4: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Bots crawlers spiders scrapers

A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet

A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data

A scraper takes downloaded pages and attempts to extractdata from them

However there are no precise defitnions

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 5: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Example areas of usage

Download results of general elections

Download demo MP3

Download legal decisions

Download comments and usersrsquo activity from social media

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 6: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Alternatives to traditional scraping

ScraperWiki

Google Refine

Outwit

SocSciBot

(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 7: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Is it legal

Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website

Look for disclaimers and Terms of (Fair) Use

Scraping may mean copyright infringement

Data privacy sensitive information

web crawling ndash difficult matter

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 8: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Controversial cases

eBay vs Bidderrsquos Edge

One company crawls information from another

Facebook vs Meltwater

Prior written permissions needed

US vs Aaron Swartz

AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013

GHTorrent issue32 incident

httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 9: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Web scraping for social good or evil

People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy

Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using

Celebgate httpenwikipediaorgwikiICloud_leaks_of_

celebrity_photos

AI gone too far httpwwwtelegraphcouktechnology20170908

ai-can-tell-people-gay-straight-one-photo-face

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 10: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash robotstxt

Many web crawlers are hunting for content

Web robots keep some indices actual

robotstxt file and other methods are used for keep theserver traffic in check

User-agent

Disallow private

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 11: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

robotstxt ndash limitations

Considered outdated standard due to its limitations

The subsite can be still indexed if another site redirects to it

Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it

Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites

Read httpsupportgooglecomwebmastersanswer6062608

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 12: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Legal issues and famous casesPreventing bots

Counteracting scraping ndash ltmetagt tags

A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags

ltmeta name=robots content=noindexgt

Read httpssupportgooglecomwebmastersanswer93710

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 13: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

What makes a crawler polite

A polite crawler respects robotstxt

A polite crawler never degrades a websitersquos performance

A polite crawler identifies its creator with contact information

A polite crawler does not drive system administrators crazy

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 14: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 1 ndash Terms of Use

Make sure your crawler follows the rules contained in Terms of(Fair) Use

Do check robotstxt file

(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 15: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 2 ndash Identify yourself

Include your company name and an email address or websitein the requestrsquos User-Agent header

For example Googlersquos crawler user agent is ldquoGooglebotrdquo

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 16: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 3 ndash Crawl delay

Make sure your crawler does not hit a website too hard

Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive

If you do not they will make you do

2012-12-21 050436+0800 [working] DEBUG Retrying

httpwwwexamplecomprofilephpid=1580gt (failed 1

times) 503 Service Unavailable

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 17: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Responsible crawling 4 ndash Take care of your data

You should pay attention to what happens with the data yougathered

Consider which part of the analyses andor data sets can bedisplayed publicly

Respect the sensitivity of the information

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 18: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Good practices amp ethics in a nutshell

Respect the hosting sitersquos wishes

robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap

Respect the hosting sitersquos bandwidth

Scraping may be costly or result in the site going down

Respect the law

Terms amp agreements

Take care of the data

Consider which information can be displayed publicly or not

If possible use APIs APIs mirrors or collections of dataalready scraped for you

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 19: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Requests library

Python core libraries may be enough but there are moreappropriate packages

Requests library is suitable for complicated HTTP requestscookies headers and other issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 20: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Beautiful what

Beautiful Soup is Python library

Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland

The tool is for making sense of nonsensical

Beautiful Soup so rich and greenWaiting in a hot tureen

Who for such dainties would not stoopSoup of the evening beautiful Soup

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 21: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Installing Beautiful Soup

httpwwwcrummycomsoftwareBeautifulSoupbs4doc

httppypipythonorgpypisetuptools

Think about setting Python virtual environments

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install beautifulsoup4

Other environments

python setuppy install

pip install beautifulsoup4

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 22: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Problems after installation

Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 23: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Typical way of working

You work with file containing python code and import thenecessary libraries

It requires some skills in programming

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 24: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

What is Scrapy

Scrapy is a framework which means a lot of issues related toscraping have already been solved for you

and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)

Requires minimal knowledge and skills in Python andprogramming

Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 25: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash installation

httpscrapyorg

httpdocscrapyorgenlatestintroinstallhtml

httppypipythonorgpypisetuptools

Think about setting Python virtual environments(recommended)

Unixlike (Mac and GNULinux)

sudo easy install pip

pip install Scrapy

Other environments

Read the manual recommended use of anaconda

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 26: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash dependencies

If you encounter problems during the installation probablyyou should check for the dependencies eg

lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 27: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash typical way of working

You work with projects in a new directory

Each Scrapy object stands for a single page on the website

Each scrapper has a few core functions and structures

Crawlers are run from the command line

Troubleshooting usually within configuration files(settingspy)

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 28: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Request libraryBeautiful SoupScrapy

Scrapy ndash responsible scraping

ROBOTSTXT OBEY = True

USER AGENT = rsquoMyCompany-MyCrawler

(botmycompanycom)rsquo

DOWNLOAD DELAY

CONCURRENT REQUESTS PER DOMAIN

2016-08-19 161256 [scrapy] DEBUG Forbidden by

robotstxt ltGET httpwebsitecomlogingt

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework
Page 29: Web scraping and social media scraping { introductioncoin.wne.uw.edu.pl/dcelinska/resources/webscraping/scraping_01.pdf · A crawler or spider is a bot which systematically browses

Motivation and definitionsControversies

Good practicesIntroduction to software

Homework

Homework

Install the libraries and frameworks on your hardware

Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml

  • Motivation and definitions
  • Controversies
    • Legal issues and famous cases
    • Preventing bots
      • Good practices
      • Introduction to software
        • Request library
        • Beautiful Soup
        • Scrapy
          • Homework