Web Scraping
-
Upload
primeteacher32 -
Category
Career
-
view
105 -
download
1
Transcript of Web Scraping
WEB SCRAPING
WHAT IS WEB SCRAPING
• WEB SCRAPING IS THE TERM FOR USING A PROGRAM TO DOWNLOAD AND PROCESS CONTENT FROM THE WEB. • FOR INSTANCE GOOGLE RUNS SEVERAL WEB SCRAPING PROGRAMS TO INDEX
WEB PAGES FOR ITS SEARCH ENGINE• WEB SITES ARE WRITTEN USING HTML, WHICH MEANS THAT EACH WEB
PAGE IS A STRUCTURED DOCUMENT. SOMETIMES IT WOULD BE GREAT TO OBTAIN SOME DATA FROM THEM AND PRESERVE THE STRUCTURE WHILE WE’RE AT IT. WEB SITES DON’T ALWAYS PROVIDE THEIR DATA IN COMFORTABLE FORMATS SUCH AS CSV OR JSON.• THIS IS WHERE WEB SCRAPING COMES IN.• UNLESS YOU HAVE AN API TO USE
THE WEB SCRAPING MODULES
• THE WEBBROWSER MODULE HAS SOME BUILT IN FUNCTIONALITY TO AID US IN THE WEB SCRAPING PROCESS:• WEBBROWSER – COMES WITH PYTHON AND OPEN A BROWSER TO A
SPECIFIC PAGE• REQUESTS – DOWNLOADS FILES AND WEB PAGES FROM
THE INTERNET• BEATUIFUL SOUP – PARSES HTML
• WEB SCRAPING PROCESS:• 1. USE WEBBROWSER TO OPEN A URL• 2. USE REQUESTS TO DOWNLOAD THE CONTENT• 3. USE BEAUTIFUL SOUP TO PARSE AND SEARCH THE
INFORMATION
THE WEBBROWSER MODULE
• THE WEBBROWSER MODULE’S OPEN() FUNCTION CAN LAUNCH A NEW BROWSER TO A SPECIFIED URL.
• THE OPEN() FUNCTION CAN DO SOME INTERESTING THINGS:• WRITE A SCRIPT TO LAUNCH GOOGLE MAPS USING AN ADDRESS THAT WAS
COPIED TO YOUR CLIPBOARD OR FROM COMMAND LINE ARGUMENTS.
• EX. WEBBROWSER.OPEN(‘HTTP://GOOGLE.COM’)
WEBBROWSER EXAMPLE
• FIRST YOU NEED TO STUDY THE URL FOR GOOGLE MAPS ADDRESSES• HTTPS://WWW.GOOGLE.COM/MAPS/PLACE/333+N+SHIPLEY+ST,+WILMINGTON,+DE+19801/
@39.7404685,-75.5546281,17Z/DATA=!3M1!4B1!4M5!3M4!1S0X89C6FD69B80CAB35:0XB7D1AD3CAC62FB67!8M2!3D39.740503!4D-75.552341
IMPORT WEBBROWSERIMPORT SYSIMPORT PYPERCLIP
IF LEN(SYS.ARGV) > 1:ADDRESS = ‘ ‘.JOIN(SYS.ARGV[1:])
ELSE:ADDRESS = PYPERCLIP.PASTE()
WEBBROWSER.OPEN(‘HTTP://WWW.GOOGLE.COM/MAPS/PALCE/’ + ADDRESS)
THE REQUESTS MODULE
• THE REQUESTS MODULE LETS YOU EASILY DOWNLOAD FILES AND WEB PAGES FROM THE WEB.• IT IS NOT PREINSTALLED WITH PYTHON SO, PIP INSTALL IT
• THE GET() FUNCTION TAKES A STRING OF A URL TO DOWNLOAD,
• THE RAISE_FOR_STATUS() FUNCTION CHECKS IF THE DOWNLOAD WS A SUCCESS.
REQUESTS EXAMPLEIMPORT REQUESTS
RES = REQUESTS.GET(‘HTTPS://AUTOMATETHEBORINGSTUFF.COM/FILES/RJ.TXT’)TRY:
RES.RAISE_FOR_STATUS()EXCEPT EXCEPTION AS EXC:
PRINT(‘THERE WAS A PROBLEM: %S’ % (EXC))PLAYFILE = OPEN(“RANDJ.TXT”, ‘WB’)
FOR CHUNK IN RES.ITER_CONTENT(100000):PLAYFILE.WRITE(CHUNK)
PLAYFILE.CLOSE()
THE BEAUTIFUL SOUP MODULE