Web crawlers part-2-20161104

• Captcha for learning and collecting data

GOOLE ENERGY REDUCTION

RECAPTCHA

MACHINE LEARNINGweb crawlers part 2

IS YOUR BROWSER UNIQUE?

• https://amiunique.org

• https://panopticlick.eff.org

https://panopticlick.eff.org

ROBOTS.TXT

User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/

SITEMAPS

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>

CRAWLER TRAPS

• Captcha

• Hidden fields

• Hidden in CSS

ONION ;)

import socks import socket from urllib.request import urlopen

socks.set_default_proxy(socks.SOCKS5, "localhost", 9150) socket.socket = socks.socksocket print(urlopen('http://icanhazip.com').read())

URLOPENfrom urllib.request import urlopen

html = urlopen("http://sampleshop.pl/product/happy-ninja-2/")

print(html.read())

“BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible

Python objects representing XML structures.”

BEAUTIFUL SOUP

from urllib.request import urlopen from bs4 import BeautifulSoup

html = urlopen("http://sampleshop.pl") bsObj = BeautifulSoup(html.read(), "html.parser")

print(bsObj.h1)

REQUESTS

import requests from bs4 import BeautifulSoup

session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }

url = "https://www.whatismybrowser.com/developers/ what-http-headers-is-my-browser-sending"

req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("table", {"class": "table-striped"}).get_text)

https://www.whatismybrowser.com/developers/

REQUESTS

import requests from bs4 import BeautifulSoup

session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }

url = "http://sampleshop.pl/shop"

req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("p", {"class": "site-description"}).get_text())

http://sampleshop.pl

Web crawlers part-2-20161104

Internet

Transcript of Web crawlers part-2-20161104