Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa...

40
WEB SCRAPING WEB SCRAPING By Ben Keith Quoin Inc. Oct. 13, 2016

Transcript of Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa...

Page 1: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

WEB SCRAPINGWEB SCRAPINGBy Ben Keith

Quoin Inc.

Oct. 13, 2016

Page 2: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

WHAT IS SCRAPING?WHAT IS SCRAPING?Programa�cally collec�ng useful data from a website that does not have a machine-op�mized interface

Scraping is the inverse of rendering an HTML template with data

Normal Flow: data + template -> renderer -> HTML

Scraping: HTML -> scraper -> data + [template]

Look for APIs first

Page 3: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CRAWLINGCRAWLINGTraversing a website for the purpose of mirroring/scrapingDone by a crawler/spider

Page 4: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

SIMPLEST CASESIMPLEST CASEWell-structured sta�c HTML served directly by the web server

List/detail pa�ernRequirements for doing it in code:

HTTP clientHTML Parser (or regex if it's very simple)CSS/XPath query library (very helpful)

Example: Quoin website news

Page 5: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

POTENTIAL PROBLEMSPOTENTIAL PROBLEMSBrowsers are very permissive with the markup they accept

Solu�on: use an equally permissive HTML parsing library

Page 6: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

MORE DIFFICULT CASEMORE DIFFICULT CASEWell-structured dynamic HTML generated by Javascript

Ini�al page request won't return data to scrapeRequirements:

Download/render page using JS/CSSParse/Query resul�ng HTMLBasically what a browser does

Example: Vine trends pageIf it's JS-rendered, you can quite possibly use AJAX endpoints

Bypass a lot of scraping

Page 7: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

PROBLEM: POORLY-STRUCTURED DATAPROBLEM: POORLY-STRUCTURED DATANeeds custom logic to get high quality datah�ps://www.ncsu.edu/directory/departmental/

Page 8: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

ADVANCED TECHNIQUESADVANCED TECHNIQUESUse AI tools to automa�cally scrape important content from pagesData MiningFind pa�erns of repea�ng structure

What differs between themWhat is boilerplate

Page 9: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

SEMANTIC WEB (WEB 3.0)SEMANTIC WEB (WEB 3.0)

Page 10: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

POPULAR TOOLS/LIBRARIESPOPULAR TOOLS/LIBRARIESScrapy

Python libRuns requests in parallelAutoma�cally follow all linksBasically adds the paralleliza�on framework on top of plain Python

abot.NET scraperCommercial version

Mozenda (Windows desktop app)Many more

Page 11: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

GUI TOOLSGUI TOOLSPor�a

GUI that lets you select elements to scrapeSelect sample pages to give it a pa�ernPredefined or regex extractorsSeems to work for simple, very well-structured data

Page 12: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain
Page 13: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

PLAIN OLD PYTHONPLAIN OLD PYTHONThis seems to be very commonrequests, lxml.html.soupparser

Page 14: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CRAWLING AS A SERVICECRAWLING AS A SERVICEMany companies

Some provide the crawling infrastructure and let you submit your own code (ScrapingHub)Describe what you want and they'll write/run scraper~$300 for setup and $40/month for data updates (DataHut, others)Handle any breaking changes

Custom scraper is going to give best data quality

Page 15: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

ANTI-CRAWLING TECHNIQUESANTI-CRAWLING TECHNIQUES

Page 16: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

ROBOTS.TXTROBOTS.TXTAsk politely

User-agent: *Disallow: /

Page 17: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

ROBOTS.TXT: COUNTERMEASURESROBOTS.TXT: COUNTERMEASURESIgnore it

Isn't legally binding

Page 18: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

HONEYPOTSHONEYPOTSCreate pages that normal users would never access

e.g. CSS-hidden linksBlock IP address of host accessing the honeypot page

Page 19: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

HONEYPOTS: COUNTERMEASURESHONEYPOTS: COUNTERMEASURESUse real rendering engine and test for visibility before following links

Much slower than processing sta�c HTML

Page 20: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

TEXT OBFUSCATIONTEXT OBFUSCATIONImages/Sprites to render text

Very bad for SEO, accessibility, and performance

Page 21: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

TEXT OBFUSCATION: COUNTERMEASURESTEXT OBFUSCATION: COUNTERMEASURESOCR of screenshots of pageWould make scraping very difficult

But has so many disadvantages to the site owner

Page 22: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

BEHAVIOR ANALYSISBEHAVIOR ANALYSISLook for methodical/unnatural HTTP request pa�erns

Bots crawl sites as if they were traversing a tree (breadth or depth-first)Bots tend to access pages in rapid successionMissing requests for things that normal browsers would downloadMachine learning useful hereCaptchas to mi�gate false posi�ves (Google does this)Large sites may have many legi�mate users on only a few IP addressesBan IP if suspicious

Page 23: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

Somehow Bing avoided this

Page 24: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

BEHAVIOR ANALYSIS:COUNTERMEASURESBEHAVIOR ANALYSIS:COUNTERMEASURESAdd randomness to crawl orderDon't crawl too fast from single IP addrUse mul�ple IP addresses (see below)Fetch cluster

Load one page with one host and move on to another

HTTP Proxy serviceThousands of IP addressesMul�ple countriesBan detec�on and auto retryBasically a (presumably) legal botnet

Crawlera

Page 25: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

REGISTRATION/CAPTCHAREGISTRATION/CAPTCHARequire user registra�on/Captcha to even view data

Easy to trace usage for individual accounts and banUse strong Captcha to prevent bot registra�ons

Require phone/credit card verifica�on (i.e. Facebook)Very poor user experience

Lost revenue

Page 26: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

REGISTRATION/CAPTCHA: COUNTERMEASURESREGISTRATION/CAPTCHA: COUNTERMEASURESCaptcha solving services

Hire the poor in Southeast AsiaPrices seem to average ~ $1 per 1000 captchas in bulk~ 10 sec response �me

For phone: disposable SIM cards/Disposable debit cardsVery burdensome

Page 27: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CHANGE PAGE STRUCTURECHANGE PAGE STRUCTUREPeriodically change the markup structure of your pages

Change CSS class namesChange DOM IDsChange hierarchySomewhat burdensome on developer resourcesSupposedly one of the most effec�ve means

Page 28: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CHANGE PAGE STRUCTURE: COUNTERMEASURESCHANGE PAGE STRUCTURE: COUNTERMEASURESChange scraping logic

Increases maint. cost of scrapingUse AI techniques to scrape data independent of markup

Page 29: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

IS IT LEGAL?IS IT LEGAL?Many overlapping legal issues

Page 30: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

Copyright viola�onsHTML may be copyrightable (Media.net v. Netseer)

Sufficiently crea�veFacebook v. Power.com

Ephimeral, unauthorized downloads of copyrighted materialBoilerplate text, videos, photos, etc.

Had to download whole page that had copyrighted materialRejected Connect API over license

DMCA

Page 31: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

Breach of Contract (Terms of Use)Browsewrap vs clickwrapBrowsewrap is legally very weak (Cvent v. Eventbrite)Clickwrap imprac�cal for public sites

Page 32: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

Computer Fraud and Abuse Act (CFAA; Federal law)Passed in 1986 as criminal lawAmendment in 1994 allowed civil ac�onsExceeding authoriza�on

Site owner tells you to stop accessing site revokes authoriza�onTerms of Use (Browsewrap)

Must cause damage or loss directly related to computer accesse.g. by excess server load that causes lost traffic or server crashMust be inten�onalDoes not include loss from use of stolen data

Page 33: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

Common to see mul�ple claims in one lawsuit to maximize chances of success"Proxy defenses" usually don't workOther charges include:

Trespass to Cha�els (not used as much anymore)Interfere with another's posesssion of property (e.g. servers)

Unjust enrichment

Page 34: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

EBAY V. BIDDER'S EDGEEBAY V. BIDDER'S EDGEFirst major scraping case from 2000BE was an auc�on site aggregatorIn 1999, eBay allowed BE to crawl site for 90 daysFailed to formalize license agreement

eBay wanted on-demand crawlingBE wanted periodic crawling

At end of 90 days, BE con�nued crawling despite no agreement

Page 35: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

EBAY V. BIDDER'S EDGE (CONT.)EBAY V. BIDDER'S EDGE (CONT.)IPs were blocked but BE used proxieseBay sought preliminary injunc�on to block BE from access site

Claimed 8 different causeseBay was granted injunc�on based on trespass of cha�els claim

Based on hypothe�cal harm to servers if anyone could freely crawlThis reasoning later rejected in separate case

Se�led out of court

Page 36: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

AA V. FARECHASEAA V. FARECHASEFrom 2003Farechase sold so�ware that specifically scraped AA.comAA sent cease & desistScraping violated terms of useAA got injunc�on barring so�ware sales

Based primarily on trespass to cha�els

Page 37: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CRAIGSLIST V. 3TAPSCRAIGSLIST V. 3TAPSFederal court in California in 2012Scraped CL house lis�ngs to resellCL made all posts copyrighted for three weeksCL sent cease-and-desist le�er to 3Taps and blocked IPs

3Taps ignored it and used proxiesCourt ruled copyright claims were enforcible and that CFAA was violated$1M se�lement and permanent injunc�on

The court likened its decision to allowing a store to open itself to the public butalso to ban a disrup�ve person if it needed to.

Page 38: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

QVC V. RESULTLYQVC V. RESULTLYResultly: Web retail aggregatorHad no prior rela�onship to QVCOverloaded QVCs servers for two days

Claimed $2M in lose revenuePeaked at ~ 400 reqs/sec

QVC sought preliminary injunc�on to freeze Resultly's assets based on CFAA claimDenied due to lack of inten�onalityResultly couldn't have known that their servers would crash

Page 39: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

CONCLUSIONCONCLUSIONIf the site tells you to stop and you keep doing it, you're probably in a bad posi�on in court

Scraping is ubiquitous but of dubious legality

Page 40: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain

THE ENDTHE END