Web crawlers part-2-20161104

17
Captcha for learning and collecting data

Transcript of Web crawlers part-2-20161104

Page 1: Web crawlers part-2-20161104

• Captcha for learning and collecting data

Page 2: Web crawlers part-2-20161104

GOOLE ENERGY REDUCTION

Page 3: Web crawlers part-2-20161104

GOOLE ENERGY REDUCTION

Page 4: Web crawlers part-2-20161104

RECAPTCHA

Page 5: Web crawlers part-2-20161104

RECAPTCHA

Page 6: Web crawlers part-2-20161104

RECAPTCHA

Page 7: Web crawlers part-2-20161104

MACHINE LEARNINGweb crawlers part 2

Page 8: Web crawlers part-2-20161104

IS YOUR BROWSER UNIQUE?

• https://amiunique.org

• https://panopticlick.eff.org

Page 9: Web crawlers part-2-20161104

ROBOTS.TXT

User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/

Page 10: Web crawlers part-2-20161104

SITEMAPS

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>

Page 11: Web crawlers part-2-20161104

CRAWLER TRAPS

• Captcha

• Hidden fields

• Hidden in CSS

Page 12: Web crawlers part-2-20161104

ONION ;)

import socks import socket from urllib.request import urlopen

socks.set_default_proxy(socks.SOCKS5, "localhost", 9150) socket.socket = socks.socksocket print(urlopen('http://icanhazip.com').read())

Page 13: Web crawlers part-2-20161104

URLOPENfrom urllib.request import urlopen

html = urlopen("http://sampleshop.pl/product/happy-ninja-2/")

print(html.read())

Page 14: Web crawlers part-2-20161104

“BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible

Python objects representing XML structures.”

Page 15: Web crawlers part-2-20161104

BEAUTIFUL SOUP

from urllib.request import urlopen from bs4 import BeautifulSoup

html = urlopen("http://sampleshop.pl") bsObj = BeautifulSoup(html.read(), "html.parser")

print(bsObj.h1)

Page 16: Web crawlers part-2-20161104

REQUESTS

import requests from bs4 import BeautifulSoup

session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }

url = "https://www.whatismybrowser.com/developers/ what-http-headers-is-my-browser-sending"

req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("table", {"class": "table-striped"}).get_text)

Page 17: Web crawlers part-2-20161104

REQUESTS

import requests from bs4 import BeautifulSoup

session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }

url = "http://sampleshop.pl/shop"

req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("p", {"class": "site-description"}).get_text())