Post on 16-Apr-2017
• Captcha for learning and collecting data
GOOLE ENERGY REDUCTION
GOOLE ENERGY REDUCTION
RECAPTCHA
RECAPTCHA
RECAPTCHA
MACHINE LEARNINGweb crawlers part 2
IS YOUR BROWSER UNIQUE?
• https://amiunique.org
• https://panopticlick.eff.org
ROBOTS.TXT
User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/
SITEMAPS
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>
CRAWLER TRAPS
• Captcha
• Hidden fields
• Hidden in CSS
ONION ;)
import socks import socket from urllib.request import urlopen
socks.set_default_proxy(socks.SOCKS5, "localhost", 9150) socket.socket = socks.socksocket print(urlopen('http://icanhazip.com').read())
URLOPENfrom urllib.request import urlopen
html = urlopen("http://sampleshop.pl/product/happy-ninja-2/")
print(html.read())
“BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible
Python objects representing XML structures.”
BEAUTIFUL SOUP
from urllib.request import urlopen from bs4 import BeautifulSoup
html = urlopen("http://sampleshop.pl") bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj.h1)
REQUESTS
import requests from bs4 import BeautifulSoup
session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }
url = "https://www.whatismybrowser.com/developers/ what-http-headers-is-my-browser-sending"
req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("table", {"class": "table-striped"}).get_text)
REQUESTS
import requests from bs4 import BeautifulSoup
session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' }
url = "http://sampleshop.pl/shop"
req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("p", {"class": "site-description"}).get_text())