Web Scraping

8
WEB SCRAPING

Transcript of Web Scraping

Page 1: Web Scraping

WEB SCRAPING

Page 2: Web Scraping

WHAT IS WEB SCRAPING

• WEB SCRAPING IS THE TERM FOR USING A PROGRAM TO DOWNLOAD AND PROCESS CONTENT FROM THE WEB. • FOR INSTANCE GOOGLE RUNS SEVERAL WEB SCRAPING PROGRAMS TO INDEX

WEB PAGES FOR ITS SEARCH ENGINE• WEB SITES ARE WRITTEN USING HTML, WHICH MEANS THAT EACH WEB

PAGE IS A STRUCTURED DOCUMENT. SOMETIMES IT WOULD BE GREAT TO OBTAIN SOME DATA FROM THEM AND PRESERVE THE STRUCTURE WHILE WE’RE AT IT. WEB SITES DON’T ALWAYS PROVIDE THEIR DATA IN COMFORTABLE FORMATS SUCH AS CSV OR JSON.• THIS IS WHERE WEB SCRAPING COMES IN.• UNLESS YOU HAVE AN API TO USE

Page 3: Web Scraping

THE WEB SCRAPING MODULES

• THE WEBBROWSER MODULE HAS SOME BUILT IN FUNCTIONALITY TO AID US IN THE WEB SCRAPING PROCESS:• WEBBROWSER – COMES WITH PYTHON AND OPEN A BROWSER TO A

SPECIFIC PAGE• REQUESTS – DOWNLOADS FILES AND WEB PAGES FROM

THE INTERNET• BEATUIFUL SOUP – PARSES HTML

• WEB SCRAPING PROCESS:• 1. USE WEBBROWSER TO OPEN A URL• 2. USE REQUESTS TO DOWNLOAD THE CONTENT• 3. USE BEAUTIFUL SOUP TO PARSE AND SEARCH THE

INFORMATION

Page 4: Web Scraping

THE WEBBROWSER MODULE

• THE WEBBROWSER MODULE’S OPEN() FUNCTION CAN LAUNCH A NEW BROWSER TO A SPECIFIED URL.

• THE OPEN() FUNCTION CAN DO SOME INTERESTING THINGS:• WRITE A SCRIPT TO LAUNCH GOOGLE MAPS USING AN ADDRESS THAT WAS

COPIED TO YOUR CLIPBOARD OR FROM COMMAND LINE ARGUMENTS.

• EX. WEBBROWSER.OPEN(‘HTTP://GOOGLE.COM’)

Page 5: Web Scraping

WEBBROWSER EXAMPLE

• FIRST YOU NEED TO STUDY THE URL FOR GOOGLE MAPS ADDRESSES• HTTPS://WWW.GOOGLE.COM/MAPS/PLACE/333+N+SHIPLEY+ST,+WILMINGTON,+DE+19801/

@39.7404685,-75.5546281,17Z/DATA=!3M1!4B1!4M5!3M4!1S0X89C6FD69B80CAB35:0XB7D1AD3CAC62FB67!8M2!3D39.740503!4D-75.552341

IMPORT WEBBROWSERIMPORT SYSIMPORT PYPERCLIP

IF LEN(SYS.ARGV) > 1:ADDRESS = ‘ ‘.JOIN(SYS.ARGV[1:])

ELSE:ADDRESS = PYPERCLIP.PASTE()

WEBBROWSER.OPEN(‘HTTP://WWW.GOOGLE.COM/MAPS/PALCE/’ + ADDRESS)

Page 6: Web Scraping

THE REQUESTS MODULE

• THE REQUESTS MODULE LETS YOU EASILY DOWNLOAD FILES AND WEB PAGES FROM THE WEB.• IT IS NOT PREINSTALLED WITH PYTHON SO, PIP INSTALL IT

• THE GET() FUNCTION TAKES A STRING OF A URL TO DOWNLOAD,

• THE RAISE_FOR_STATUS() FUNCTION CHECKS IF THE DOWNLOAD WS A SUCCESS.

Page 7: Web Scraping

REQUESTS EXAMPLEIMPORT REQUESTS

RES = REQUESTS.GET(‘HTTPS://AUTOMATETHEBORINGSTUFF.COM/FILES/RJ.TXT’)TRY:

RES.RAISE_FOR_STATUS()EXCEPT EXCEPTION AS EXC:

PRINT(‘THERE WAS A PROBLEM: %S’ % (EXC))PLAYFILE = OPEN(“RANDJ.TXT”, ‘WB’)

FOR CHUNK IN RES.ITER_CONTENT(100000):PLAYFILE.WRITE(CHUNK)

PLAYFILE.CLOSE()

Page 8: Web Scraping

THE BEAUTIFUL SOUP MODULE