Scraping Handout

9
1 Basic Web Scraping Scraping is a process by which you can extract data from an html page or pdf, into a CSV or other format, so you can work with it in Excel or another spreadsheet and use in your visualizations. Sometimes you can just copy and paste data from an html table into a spreadsheet, and all will be fine. Other times, not. So, you have to create a computer program that allows you to identify and scrape the data you want. There are numerous ways in which you can do this, some that involve programming and some that don’t. Scraping with Google Spreadsheet If copying and pasting directly into a spreadsheet doesn’t work, you can try using Google Spreadsheet functions to scrape the data. Open a new Google Spreadsheet. We are going to scrape the data from the Texas Music Office site. The url http://governor.state.tx.us/music/musicians/talent/talent/", goes to the first page of the directory, listing the bands that start with A. In the first cell of your spreadsheet, type the function: =ImportHtml("http://governor.state.tx.us/music/musicians/ta lent/talent/", "table", 1) First argument is the url Second argument tells it to look for a table (the other element allowed here is “list”), Third argument is the index of the table (if there are multiple tables on the page). You will have to look at the html to find the table from which you are trying to get the data or through trial and error by changing the third number. Give it a couple seconds and you should see the data directly from the table in your spreadsheet. Easy! Google Spreadsheets has a few other functions that can be helpful in scraping data. ImportFeed will scrape from an RSS feed. Try: =ImportFeed(http://news.google.com/news?pz=1&cf=all&ned=us& Coding for Communicators Dr. Cindy Royal CU-Boulder

Transcript of Scraping Handout

1

Basic Web Scraping Scraping is a process by which you can extract data from an html page or pdf, into a CSV or other format, so you can work with it in Excel or another spreadsheet and use in your visualizations. Sometimes you can just copy and paste data from an html table into a spreadsheet, and all will be fine. Other times, not. So, you have to create a computer program that allows you to identify and scrape the data you want. There are numerous ways in which you can do this, some that involve programming and some that don’t. Scraping with Google Spreadsheet If copying and pasting directly into a spreadsheet doesn’t work, you can try using Google Spreadsheet functions to scrape the data. Open a new Google Spreadsheet. We are going to scrape the data from the Texas Music Office site. The url http://governor.state.tx.us/music/musicians/talent/talent/", goes to the first page of the directory, listing the bands that start with A. In the first cell of your spreadsheet, type the function:

=ImportHtml("http://governor.state.tx.us/music/musicians/talent/talent/", "table", 1)

• First argument is the url • Second argument tells it to look for a table (the other element allowed

here is “list”), • Third argument is the index of the table (if there are multiple tables on

the page). You will have to look at the html to find the table from which you are trying to get the data or through trial and error by changing the third number. Give it a couple seconds and you should see the data directly from the table in your spreadsheet. Easy! Google Spreadsheets has a few other functions that can be helpful in scraping data.

• ImportFeed will scrape from an RSS feed. Try: =ImportFeed(http://news.google.com/news?pz=1&cf=all&ned=us&

Coding for Communicators Dr. Cindy Royal CU-Boulder

2

hl=en&topic=t&output=rss) Find any RSS feed by looking for the RSS icon. This link pulls current items from Google News.

• If your data is already in CSV format, you can save a step to bring it

into your spreadsheet by using ImportData. This will also scrape the data directly from the site, so it will pull from the most recent version of the file. Try: =importData("http://www.census.gov/popest/data/national/totals/2014/files/NST_EST2014_ALLDATA.csv")

Of course you can also just open the csv in the spreadsheet program and follow the instructions for importing it.

Chrome Scraper Extension https://chrome.google.com/extensions/detail/mbigbapnjcgaffohmbkdlecaccepngjd - free extension for Chrome. Select content on a page, use the context menu to Scrape Similar. Can export results to a Google Doc. Download and install the Chrome Scraper extension. Go to this page: http://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films It includes a list of Academy Award-winning Films in a table. Select the first row of the table. Ctrl-click and choose Scrape Similar. You should see the entire table. You can easily Export to Google Docs with the button.

3

Notice the XPath description. This is the code the scraper used to find the table.

//div[4]/table[1]/tbody/tr[td] It found the first table in the fourth div and extracted elements in the tds. This method did not get the links. To do that, let’s try just right clicking on the first item in the table (unselect the entire row). This should find all the links. Take a look at the difference in XPath code:

//td/i/a This technique finds all the links within <i> tags in the tds.

4

You can learn more about XPath syntax at http://www.w3schools.com/xpath/xpath_syntax.asp. Import.io Web-based scraping and API creation. This is a great site. You can input a URL and get the resulting data. You can download it or you can create an API that allows you to access it live in an application.

5

Using the Import.io App Tutorial by Becky Larson

Import.io (https://www.import.io/) is a great tool for extracting data from a website. The Web version of import.io is very powerful. Simply include a url and let it find the data. But if you have more advanced requirements, like scraping data from more than one page throughout a site, you can use the Import.io application. 1. Download the app from the website for your platform -

https://www.import.io/download/download-info/ 2. Open the import.io desktop app

3. Click New

a. Choose which option you want, whether the regular “magic” setting, the extractor, or the crawler. We’ll use the crawler

4. Navigate to a page - Import.io will open a new tab, the top half of which will look sort of like a normal browser. It will tell you to open a page from which you’d like to extract data.

a. Open the page you’d like to go to in a regular browser and copy and paste the URL into the address bar in the tab import.io has opened.

i. Which page you use will depend on what you want the crawler to do – see step 6 below and decide what kind of data you need before choosing your url.

5. Click “I’m there” in the lower right corner of the crawler window once the page has opened.

6

6. Click “Detect optimal settings” - Import.io will try to detect optimal

settings to display your page, which means it may try to turn off JavaScript or other advanced functions.

a. If pieces of your page you were hoping to capture or see do not load, click the option saying so in the lower right crawler window – import.io will turn those functions back on.

7. Tell import whether the page you’re on gives data for one thing or for many.

a. In their example on Youtube, the creator is crawling a clothing site and is using product pages for specific clothing items to train the crawler. In this example, the page is giving data on one item (one piece of clothing).

b. In my example of crawling the SXSW Schedule, I am using pages for specific panels, so same thing: each page has data for ONE thing. It might be lots of data, but it’s for one panel or one item.

i. If I only needed data that was available on the main schedule page, I could crawl from there and choose the “many items” option, but I want all the specific data each panel page gives me.

ii. If you were on a page that looked more like a table or list (like the main sxsw schedule page) and included data for

7

many different items, you would choose that option. Instructions from here will follow the “one item” format.

8. Yay! Start training! a. Essentially we are teaching the crawler what we want from this

type of page, and eventually it will have learned what we need and will crawl the entire website to gather data from all pages matching that description.

9. Click “add column”, give the column a name and tell the crawler what kind of data this is – this is your first data point.

i. In my SXSW example, my first column will be the panel name. That’s important data to have right off the bat. This column is just “text”.

10. Once you have a column created, highlight what piece of data on the page you want to be associated with that column (in my example, I would highlight the panel name where it appears on the page) and click “Train”.

a. Once you hit train, the value from the page will appear under the column name in the lower left crawler window.

11. Columns with multiple data points a. Sometimes a column may have multiple pieces of data

associated with it – in the SXSW example, each panel may have up to four speakers for each panel. You could create multiple columns – speaker 1, speaker 2, etc. – and have some without data, or you can gather all that information in one column.

b. To have multiple entries, highlight the first piece of data and train the column, and then highlight the second piece and train, and so on until you have each piece. They should all be similar types of data (like all the presenter names), don’t try to gather different types in one column.

12. Continue to create columns, highlight the associated data and train those columns until you have everything you need on the page.

13. At this point, click “I have what I need” a. Import.io will prompt you to go to another page.

14. Navigate through import.io to another page like the one you just entered (another panel page or product page from the clothing site example). Don’t just copy and paste in another page from your browser. Letting the crawler follow your navigation through the site helps it understand how it will navigate through the site to other pages it needs.

15. Once you’re there, hit “I’m there” and begin the process again. a. After the first page, much of your data will import automatically.

It’s important to check through all your columns to make sure the right data has been selected. If it hasn’t, or if the selection is

8

blank, simply click on the column, highlight the right data from the page and click train.

16. Keep adding pages until you have a minimum of 5. 17. Click “Done training” 18. You’ll now have the option to upload the crawler to import.io –

go ahead and click to do so. 19. You have some advanced options for running your crawler –

what kinds of urls you want it to look for, etc, - but for right now just go ahead and click run.

a. This will take a while, depending on your Internet connection. b. Watch as the crawler finds this information on all pages on the

site! c. When it is finished, you will have the option to download the

data to a csv or json file.

9

More Scraping Tools and Resources There are many other tools that can be used effectively to scrape data from Web pages. Here are a few additional resources:

• OutWit Hub – a Firefox extension and Desktop program that can provide some advanced scraping capabilities

• ProPublica’s Scraping for Journalism: A Guide For Collecting Data - http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

• Getting to Grips with ScraperWiki For Those Who Don’t Code - http://datamineruk.wordpress.com/2011/07/21/getting-to-grips-with-scraperwiki-for-those-who-dont-code/

• Web Scraping for Non-Programmers by Michelle Minkoff - http://michelleminkoff.com/outwit-needlebase-hands-on-lab/