Detailed Scraping Instructions

39
DETAILED SCRAPING INSTRUCTIONS 22 June 2015 Table of Contents Introduction 2 Setup 2 How to Read the “Scrapings Update” Doc 2 Scraping Websites 5 Creating a New Scraper (Type 1, Info on One Page) 9 Rescraping Websites (Type 1, Info on One Page) 11 Old Scraper No Longer Works (Type 1, Info on One Page) 13 Creating a New Scraper (Type 2, Links on One Page) 15 Rescraping Websites (Type 2, Links on One Page) 16 Old Scraper No Longer Works (Type 2, Links on One Page) 17 Creating a New Scraper (Type 3, Links on Many Pages) 19 Rescraping Websites (Type 3, Links on Many Pages) 21 Old Scraper No Longer Works (Type 3, Links on Many Pages) 22 Uploading Your Results to ACT 23 Sending Mass Emails 25 1

Transcript of Detailed Scraping Instructions

Page 1: Detailed Scraping Instructions

DETAILED SCRAPING INSTRUCTIONS22 June 2015

Table of Contents

Introduction 2

Setup 2

How to Read the “Scrapings Update” Doc 2

Scraping Websites 5

Creating a New Scraper (Type 1, Info on One Page) 9

Rescraping Websites (Type 1, Info on One Page) 11

Old Scraper No Longer Works (Type 1, Info on One Page) 13

Creating a New Scraper (Type 2, Links on One Page) 15

Rescraping Websites (Type 2, Links on One Page) 16

Old Scraper No Longer Works (Type 2, Links on One Page) 17

Creating a New Scraper (Type 3, Links on Many Pages) 19

Rescraping Websites (Type 3, Links on Many Pages) 21

Old Scraper No Longer Works (Type 3, Links on Many Pages) 22

Uploading Your Results to ACT 23

Sending Mass Emails 25

Creating a Mass Email 27

Glossary of Images 28

1

Page 2: Detailed Scraping Instructions

Introduction

Worldwide Book Drive works with hundreds of student organizations every year to bring in used books, especially college-level textbooks. We put in a lot of effort to maintain those book drives from semester to semester. However, it is important to also keep a steady inflow of new student book drives as well.

To do this, we regularly scrape contact information from school websites and email their student organizations. This document is intended to give detailed instructions on the procedures for scraping and emailing school websites. Many of the directions provided will also be useful for training in using OutWit Hub Pro as well as Sage Act’s mass email feature.

Setup

1) Open OutWit Hub Pro.2) Open your browser of choice.

a. Go to drive.google.comb. Login as [email protected]. Password: Archer007

3) In Google Drive, open the spreadsheet file “Scrapings update”. Make sure you are on the “Main” tab.

4) In “Scrapings update” we keep detailed records of when each school was scraped. This is broken down by time of year. You may see columns such as “Fall 2014”, “Winter 2015”, “Spring 2015”, etc. The dated column immediately to the left of the “Scraper Name” column should be the dates of the most recent scrapes. Find that column and scroll down until you find empty cells. The schools in those rows have not been scraped recently.

How to Read the “Scrapings update” Doc

This document contains a great deal of information. Much of the information, especially the columns farther to the right, is not relevant to your task of scraping.

The columns that you will reference during scraping are “ST”, “School”, “Doing a book Drive?”, “Website URL:”, “Rank”, the most recent dated columns, “Scraper Name”, and the most recent Emails columns. Below are descriptions of these columns.

ST: Provides the state or Canadian province of the school

School: Provides the name of the college or university

2

Page 3: Detailed Scraping Instructions

Doing a book Drive?: Indicates whether or not the school is listed in the “Schools from ACT” tab. If “Yes,” it means we are probably already doing a book drive at this school, and therefore do not need to scrape the school at this time.

Website URL: Indicates the most recent website wherein information on the school’s student organizations was found. These websites are updated or changed regularly, so the most recent URL we have in this column MAY NOT be the URL you need to scrape, but it is a good place to start.

Rank: Indicates the expected ease of scraping and the quality of information.1: Easily scraped and includes all the information we want—club name, contact name,

email address, phone number, etc.2: Includes all the information we want, but is more difficult to scrape; or, is easily

scraped, but is missing the contact name3: Includes all the information we want but is extremely difficult to scrape; or, is

scrapable but is missing key information, such as club name or email address4: Impossible to scrape or very nearly so, AND is missing key information, but may

someday be updated with usable, scrapable information5: Impossible to scrape, missing key information, and not expected to update with

usable, scrapable informationND: No data; can’t even tell if the schools HAS any student organizations

Dated columns: These are usually labeled with a season and a year, such as “Spring 2015” or “Winter 2014”. These indicate the dates when the schools were scraped. Some of these columns are hidden; only the most recent two or three columns are truly relevant to your purposes. If a school has not been scraped in more than 3 months, it’s time to recontact them.

Scraper Name: We keep a folder in Dropbox with nearly all of the old scrapers. Most of these will still be usable, assuming the school’s website hasn’t been updated.

Emails: There are several Email columns, all of which should include a date or season, such as “Email Winter 2015” or “Spring 2015”. These indicate the number of emails scraped from that school. This is useful for tracking the total number of emails you have scraped in a day, and for prioritizing which schools to scrape. Schools that have historically provided more emails are usually best scraped first. Mass emailing is a numbers game; the more emails you can send out, the more responses you will get. That doesn’t mean you should ignore the small schools (in fact, we tend to have more positive responses from small schools), just that if you only have time to scrape a small number schools, you should prioritize the ones with more contacts.

3

Page 4: Detailed Scraping Instructions

Fig. 1 "Scrapings update" Doc Main tab with all relevant columns displayed

In addition to the “Main” tab, where you will primarily be working, the “Scrapings update” document also has a “Schools from ACT” tab. This must be updated manually. Every time we receive a new book drive sign-up, that club’s contact is added to the “Schools from ACT” list, along with the name and state of the school and the date they signed up. The blue columns are filled in manually, then the purple columns automatically update. The “Doing a book Drive?” column in the “Main” tab will also simultaneously automatically update, switching from “No” to “Yes.”

When a book drive wraps up, their information must be deleted from the “Schools from ACT” tab. This will switch the “Doing a book Drive?” column in the “Main” tab back to “No.” In this way, we always have a good idea of which schools are already working with us, and which ones should be recontacted.

Fig. 2 "Schools from ACT" tab in the "Scrapings update" Doc

4

Page 5: Detailed Scraping Instructions

Scraping Websites

You will quickly discover that scrabable websites break down into three categories:

Type 1: Sites wherein the contact information for all the clubs is contained on a single pageType 2: Sites in which each club has its own webpage with contact information, and the list of links is contained on a single pageType 3: Sites that give each club its own webpage with contact information, but the links to those clubs are spread across multiple pages

These three types of sites each require their own unique approaches to scrape. It will speed up your scraping if you learn to quickly identify which type of website you are scraping. Below are visual samples of the three types of websites you can expect to encounter.

5

Page 6: Detailed Scraping Instructions

Fig. 3 Sample of Type 1 Website (Info on One Page)

6

Page 7: Detailed Scraping Instructions

Fig. 4 Sample of Type 2 Website (Links on One Page)

7

Page 8: Detailed Scraping Instructions

Fig. 5 Sample of Type 3 Website (Links on Many Pages)

8

Page 9: Detailed Scraping Instructions

Creating a New Scraper (Type 1, Info on One Page)

This is the method for creating a new scraper for websites that contain all the club contact information on a single page.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL bar.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. 7) Click “New”.8) Choose a descriptive title for the scraper.

a. We usually use the format “[Two-Letter State Code] – [Name of School]”. For example, Illinois Central College would be “IL – Illinois Central College”.

b. Click “Okay”.9) Go to the “page” tab. 10) Find a sample club name, such as “Anime Club” or “Black Student Union”. Copy that

name.11) Go to the “scraper” tab.12) Paste the sample name into the “Find” section and hit Enter.13) OutWit will find the club in the source code for you. Observe the coding before and after

the club name. There are common markers you may be able to use, such as <h2>, <div>, </p>, <span>, etc. You can take advantage of the consistency of the coding to weed out the bits of information you want. NOTE: You do not need to use the code directly adjacent to the information you want. For example, if the code looks like <h1><title><strong><b> Chess Club </h1></title></strong></b>, you can simply use <h1> and </h1> as the markers before and after.

14) In the “Description” column, type “Department”. NOTE: Department is what student organizations are called in ACT.

15) In the “Marker Before” column, type a piece of code that appears before the club name. Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

16) In the “Marker After” column, type a piece of code that appears after the club name. Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

17) Click “Execute” to test your scraper so far. This will automatically take you to the “scraped” tab. There should be a nice column of club names. If the scraper is grabbing a small amount of extra, non-club-name information, that is acceptable; you can clean up the information later in Excel.

9

Page 10: Detailed Scraping Instructions

18) If there is TOO MUCH extra information or too many extra columns, your Markers Before and After are too general. Go to the “scrapers” tab. Expand either your Marker Before or your Marker After. For example, you might change your Marker Before from <span> to <title><span>. Click “Execute” to test your scraper.

Fig. 6 Example of what you may see if your markers are too general

19) If the scraper is NOT getting all the club names, your Markers Before and After are too specific. (NOTE: There may alternatively be a sneaky piece of code that is actively preventing you from scraping the information. You can’t usually do anything about that issue.) Go to the “scrapers” tab. Alter your Marker Before or your Marker After. For example, you could change your Marker After from </span> to </b>. Be sure to base this change on the coding of the site. Click “Execute” to test your scraper.

20) Once you are satisfied with the way your scraper is collecting club names, repeat steps 14 through 19 for “Contact” and “Email”. Your finished scraper should look something like this:

Fig. 7 Sample scraper

10

Page 11: Detailed Scraping Instructions

21) Click “Execute”. This will run the scraper and automatically take you to the “scraped” tab. Proceed to “Uploading Your Results to ACT”.

Rescraping Websites (Type 1, Info on One Page)

This is the method for reusing a previously-built scraper for websites that contain all the club contact information on a single page.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL area.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. Here, you can see the source code of the website, as well as

create or upload scrapers to scan the code for pieces of information.7) Click “Import”. This will open a menu.

Fig. 8 Importing a previously-built scraper

11

Page 12: Detailed Scraping Instructions

8) In the leftmost column, select “Dropbox”, then scroll down to the “Scrapers Backup” folder and double-click to open it.

9) Find the scraper for your school. Usually the scraper will be saved as “[Two-Letter State Code] – [Name of School]”. For example, Illinois Central College would be saved as “IL – Illinois Central College”. Consult the “Scraper Name” column in the “Scrapings update” document if you have trouble finding the name of the school. Consult the “Creating a New Scraper (Type 1, Info on One Page)” if there is no scraper for this school.

10) Double-click to open the scraper. The scraper will probably still work, unless the website has been updated or changed since the scraper was created.

11) Click “Execute”. OutWit will automatically take you to the “scraped” tab. 12) If the scraper still works, you should see the results laid out in nice easy columns. If that

is the case, proceed to “Uploading Your Results to ACT”.13) If the scraper no longer works properly, proceed to “Old Scraper No Longer Works

(Type 1, Info on One Page)”.

Old Scraper No Longer Works (Type 1, Info on One Page)

This is the method for modifying previously-built scrapers to work on websites that contain all the club contact information on a single page.

1) Oh no, the old scraper didn’t work! There are many reasons why this may happen. Don’t panic! We can usually get the old scraper to work with just minor adjustments.

2) Go to the “scrapers” tab.3) Check that the “Name of Scraper” is correct. Sometimes OutWit will open the wrong

scraper, or perhaps you simply selected the wrong one. If so, switch to the correct scraper.

4) Check that the “Apply If Page URL Contains” section matches the page’s URL. In particular, pay attention to whether it should be http:// or https://. Adjust the “Apply if Page URL Contains” section as necessary.

12

Page 13: Detailed Scraping Instructions

Fig. 9 Actual page URL and the URL mapped to the scraper do not match

5) If the scraper is functioning but only grabbing some of the information—for example, it is finding the club name and contact name but not the email address—we must adjust the Marker Before and Market After columns to try and zero in on the information we want. Go to the “page” tab and find a sample of the information that is not being scraped. For example, if the scraper is not finding any email addresses, find an email address for a club contact on the page.

6) Copy the sample information.7) Go to the “scrapers” tab.8) Paste the sample information into the “Find” section. 9) Hit Enter. OutWit will search the source code for that information.10) If OutWit can find the sample with the Find feature, that’s good news! The information

is in the source code and probably scrabable, we just need to adjust the “Marker Before” or “Marker After” columns. Consult those two columns now.

13

Page 14: Detailed Scraping Instructions

Fig. 10 Using the Find feature to search the source code

11) Find markers before and after your sample information that you can use. Copy those markers and paste them into the respective columns in the scraper.

12) Click “Execute” to test the scraper. You will be automatically taken to the “scraped” tab.13) If the scraper now appears to be working properly, congratulations! Proceed to

“Uploading Your Results to ACT”.14) If the scraper still is not gathering the information you want, return to the “scrapers” tab

and experiment with different markers. 15) If you still cannot get the scraper to catch the information you want, there may be an

issue with the website, not your scraper. Go to the “Scrapings update” document and bump the site’s Rank up by one. For example, if the site was ranked at 1, change it to 2. In a couple of weeks, return to this website and try again.

14

Page 15: Detailed Scraping Instructions

Creating a New Scraper (Type 2, Links on One Page)

This is the method for creating a new scraper for websites that contain all the links to the club contact info on a single page.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL area.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. Here, you can see the source code of the website, as well as

create or upload scrapers to scan the code for pieces of information.7) Click “New”.8) Choose a descriptive title for the scraper.

a. We usually use the format “[Two-Letter State Code] – [Name of School]”. For example, Illinois Central College would be “IL – Illinois Central College”.

b. Click “Okay”.9) Go to the “page” tab.10) Click one of the links to go to a club page.11) Find the club name on the page and copy it.12) Go to the “scrapers” tab.13) Paste the club name into the “Find” bar. Hit Enter.14) OutWit will find the club name in the source code. Observe the coding before and after

the club name. NOTE: You do not need to use the code directly adjacent to the information you want. For example, if the code looks like <h1><title><strong><b> Chess Club </h1></title></strong></b>, you can simply use <h1> and </h1> as the markers before and after.

15) In the “Description” column, type “Department”. NOTE: Department is what student organizations are called in ACT.

16) In the “Marker Before” column, type a piece of code that appears before the club name. Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

17) In the “Marker After” column, type a piece of code that appears after the club name. Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

18) Click “Execute” to test the scraper. You will be automatically taken to “scraped” tab.19) If the scraper is working properly, it should grab the club name. If you see that in a

column labeled “Department”, congratulations! Repeat steps 9 through 18 for Contact and Email.

20) Go to the “page” tab.

15

Page 16: Detailed Scraping Instructions

21) Click the “back” button () to return to the main page with all the links to the club pages.

22) Go to the “links” tab. In this tab, OutWit displays all the links available on the page.23) Scroll through the links and find the links to the student organizations. Shift + select

them.

Fig. 11 Club links selected and ready to be scraped

24) Right-click the selection and select “Auto Explore”, “Fast Scrape”. You will be automatically taken to the “scraped” tab.

25) The scraper will scrape each link in turn. When it finishes, proceed to “Uploading Your Results to ACT”.

Rescraping Websites (Type 2, Links on One Page)

This is the method for reusing a previously-built scraper for websites that contain all the club contact information on a single page.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL area.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. Here, you can see the source code of the website, as well as

create or upload scrapers to scan the code for pieces of information.7) Click “Import”. This will open a menu.8) In the leftmost column, select “Dropbox”, then scroll down to the “Scrapers Backup”

folder and double-click to open it.

16

Page 17: Detailed Scraping Instructions

9) Find the scraper for your school. Usually the scraper will be saved as “[Two-Letter State Code] – [Name of School]”. For example, Illinois Central College would be saved as “IL – Illinois Central College”. Consult the “Scraper Name” column in the “Scrapings update” document if you have trouble finding the name of the school. Consult the “Creating a New Scraper (Type 2, Links on One Page)” if there is no scraper for this school.

10) Double-click to open the scraper. The scraper will probably still work, unless the website has been updated or changed.

11) Go to the “links” tab. In this tab, OutWit displays all the links available on the page.12) Scroll through the links and find the links to the student organizations. Shift + select

them.13) Right-click the selection and click “Auto Explore”, “Fast Scrape”. You will be

automatically taken to the “scraped” tab.14) The scraper will scrape each link in turn. If the scraper still works, wait until it finishes,

then proceed to “Uploading Your Results to ACT”. If the scraper no longer works, proceed to “Old Scraper No Longer Works (Type 2, Links on One Page)”.

Old Scraper No Longer Works (Type 2, Links on One Page)

This is the method for modifying previously-built scrapers to work on websites that contain all the club contact information on a single page.

1) Oh no, the old scraper didn’t work! There are many reasons why this may happen. Don’t panic! We can usually get the old scraper to work with just minor adjustments.

2) Go to the “scrapers” tab.3) Check that the “Name of Scraper” is correct. Sometimes OutWit will open the wrong

scraper, or perhaps you simply selected the wrong one. If so, switch to the correct scraper.

4) Check that the “Apply If Page URL Contains” section matches the URLs for the club links. NOTE: The “Apply if Page URL Contains” does NOT need to match the URL of the main page; it must match the URLs of the actual links containing the club information.

5) If the scraper is functioning but only grabbing some of the information—for example, it is finding the club name and contact name but NOT the email address—you must adjust the Marker Before and Market After columns to try and zero in on the information you want. Go to the “page” tab. Click on one of the links to a club page to go to that page.

6) Go to the “scrapers” tab.7) Click “Execute” to test the scraper. This will automatically take you to the “scraped” tab.

Take note of which pieces of information the scraper is grabbing.

17

Page 18: Detailed Scraping Instructions

8) Go to the “page” tab. Find a sample piece of information the scraper is failing to find. For example, if the scraper is not finding the club name, find the club name on the page. Copy that sample info.

9) Go to the “scrapers” tab. 10) Paste the sample info into the “Find” bar. Hit Enter. OutWit will search the source code

for instances of the sample info.

Fig. 12 Using the Find feature to search for sample info in the code

11) Observe the code before and after the sample info. Modify the Marker Before and/or Marker After columns as necessary so that the scraper will capture that information.

12) Click “Execute” to test the scraper. You will automatically be taken to the “scraped” tab.13) If you have successfully fixed the scraper, it should now display the club name, contact

name, and email address. If so, congratulations! Click the “back” button () to return to the main page containing all the links to the club pages.

18

Page 19: Detailed Scraping Instructions

14) If the scraper is still not capturing the information you need, go to the “scrapers” tab and continue to modify the Markers Before and Markers After columns and click “Execute” to test the scraper until you get the results you are looking for.

15) Go to the “links” tab.16) Scroll through the links and find the links to the student organizations. Shift + select

them.17) Right-click the selection and click “Auto Explore”, “Fast Scrape”. You will be

automatically taken to the “scraped” tab.18) The scraper will scrape each link in turn. When it finishes, proceed to “Uploading Your

Results to ACT”.

Creating a New Scraper (Type 3, Links on Many Pages)

This is the method for creating a new scraper for websites that contain all the links to the club contact info spread across multiple pages.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL area.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. Here, you can see the source code of the website, as well as

create or upload scrapers to scan the code for pieces of information.7) Click “New”.8) Choose a descriptive title for the scraper. We usually use the format “[Two-Letter State

Code] – [Name of School]”. For example, Illinois Central College would be “IL – Illinois Central College”. Click “Okay”.

9) Go to the “page” tab.10) Click one of the links to go to a club page.11) Find the club name on the page and copy it.12) Go to the “scrapers” tab.13) Paste the club name into the “Find” bar. Hit Enter.14) OutWit will find the club name in the source code. Observe the coding before and after

the club name. 15) In the “Description” column, type “Department”. NOTE: Department is what student

organizations are called in ACT.16) In the “Marker Before” column, type a piece of code that appears before the club name.

Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

19

Page 20: Detailed Scraping Instructions

17) In the “Marker After” column, type a piece of code that appears after the club name. Keep the chunk of code as short as possible to begin with. You can expand the marker later if necessary.

18) Click “Execute” to test the scraper. You will be automatically taken to “scraped” tab.19) If the scraper is working properly, it should grab the club name. If you see that in a

column labeled “Department”, congratulations! Repeat steps 9 through 18 for Contact and Email.

20) Go to the “page” tab.21) Click the “back” button () to return to the main page with all the links to the club

pages.22) Go to the “links” tab. In this tab, OutWit displays all the links available on the page.23) Uncheck the box next to “Empty”; or, in more updated versions of OutWit, use the

dropdown menu to select “Empty on Demand”. The default setting has OutWit delete the links from the previous page every time you go to a new webpage. Adjusting this setting means that the links from the page you are currently on will be saved when you move to a new page.

24) Go to the “page” tab. 25) Click through each page that contains links to student organizations. This may mean

clicking “Next” until you reach the end of the list, or clicking through each category of clubs, such as “Academic”, “Fraternity/Sorority”, “Special Interest”, etc. Doing this will gather the links you need to scrape.

26) Go to the “links” tab.27) Scroll through the links and find the links to the student organizations. Shift + select

them.28) Right-click the selection and select “Auto Explore”, “Fast Scrape”. You will be

automatically taken to the “scraped” tab.29) The scraper will scrape each link in turn. When it finishes, proceed to “Uploading Your

Results to ACT”.

20

Page 21: Detailed Scraping Instructions

Rescraping Websites (Type 3, Links on Many Pages)

This is the method for reusing a previously-built scraper for websites that contain all the club contact information spread across multiple pages.

1) In the “Scrapings update” document, click on the school’s webpage in the “Website URL” column to open the page in a new tab in your browser.

2) Examine the webpage to ensure that it works and appears to be up-to-date.3) Copy the URL.4) Go to OutWit and paste the school’s URL into the URL area.5) Hit Enter. OutWit will open the page.6) Click on the “scrapers” tab. Here, you can see the source code of the website, as well as

create or upload scrapers to scan the code for pieces of information.7) Click “Import”. This will open a menu.8) In the leftmost column, select “Dropbox”, then scroll down to the “Scrapers Backup”

folder and double-click to open it.9) Find the scraper for your school. Usually the scraper will be saved as “[Two-Letter State

Code] – [Name of School]”. For example, Illinois Central College would be saved as “IL – Illinois Central College”. Consult the “Scraper Name” column in the “Scrapings update” document if you have trouble finding the name of the school. Consult the “Creating a New Scraper (Type 3, Links on Many Pages)” if there is no scraper for this school.

10) Double-click to open the scraper. The scraper will probably still work, unless the website has been updated or changed.

11) Go to the “links” tab.12) Uncheck the box next to “Empty”; or, in more updated versions of OutWit, use the

dropdown menu to select “Empty on Demand”. The default setting has OutWit delete the links from the previous page every time you go to a new webpage. Adjusting this setting means that the links from the page you are currently on will be saved when you move to a new page.

13) Go to the “page” tab. 14) Click through each page that contains links to student organizations. This may mean

clicking “Next” until you reach the end of the list, or clicking through each category of clubs, such as “Academic”, “Fraternity/Sorority”, “Special Interest”, etc. Doing this will gather the links you need to scrape.

15) Go to the “links” tab.16) Scroll through the links and find the links to the student organizations. Shift + select

them.17) Right-click the selection and select “Auto Explore”, “Fast Scrape”. You will be

automatically taken to the “scraped” tab.

21

Page 22: Detailed Scraping Instructions

18) The scraper will scrape each link in turn. If the scraper still works, wait until it finishes, then proceed to “Uploading Your Results to ACT”. If the scraper no longer works, proceed to “Old Scraper No Longer Works (Type 3, Links on Many Pages)”.

Old Scraper No Longer Works (Type 3, Links on Many Pages)

This is the method for modifying previously-built scrapers to work on websites that contain all the club contact information on a single page.

1) Oh no, the old scraper didn’t work! There are many reasons why this may happen. Don’t panic! We can usually get the old scraper to work with just minor adjustments.

2) Go to the “scrapers” tab.3) Check that the “Name of Scraper” is correct. Sometimes OutWit will open the wrong

scraper, or perhaps you simply selected the wrong one. If so, switch to the correct scraper.

4) Check that the “Apply If Page URL Contains” section matches the URLs for the club links. NOTE: The “Apply if Page URL Contains” does NOT need to match the URL of the main page; it must match the URLs of the actual links containing the club information.

5) If the scraper is functioning but only grabbing some of the information—for example, it is finding the club name and contact name but not the email address—we must adjust the Marker Before and Market After columns to try and zero in on the information we want. Go to the “page” tab. Click on one of the links to a club page to go to that page.

6) Go to the “scrapers” tab.7) Click “Execute” to test the scraper. This will automatically take you to the “scraped” tab.

Take note of which pieces of information the scraper is grabbing. 8) Go to the “page” tab. Find a sample piece of information the scraper is failing to find.

For example, if the scraper is not finding the club name, find the club name on the page. Copy that sample info.

9) Go to the “scrapers” tab. 10) Paste the sample info into the “Find” bar. Hit Enter. OutWit will search the source code

for instances of the sample info.11) Observe the code before and after the sample info. Modify the Marker Before and/or

Marker After columns as necessary so that the scraper will capture that information.12) Click “Execute” to test the scraper. You will automatically be taken to the “scraped” tab.13) If you have successfully fixed the scraper, it should now display the club name, contact

name, and email address. If so, congratulations! Click the “back” button () to return to the main page containing all the links to the club pages.

14) If the scraper is still not capturing the information you need, go to the “scrapers” tab and continue to modify the Markers Before and Markers After columns and click “Execute” to test the scraper until you get the results you are looking for.

22

Page 23: Detailed Scraping Instructions

15) Go to the “links” tab.16) Uncheck the box next to “Empty”; or, in more updated versions of OutWit, use the

dropdown menu to select “Empty on Demand”. The default setting has OutWit delete the links every time you go to a new webpage. Adjusting this setting means that the links from the page you are currently on will be saved when you move to a new page.

17) Go to the “page” tab. 18) Click through each page that contains links to student organizations. This may mean

clicking “Next” until you reach the end of the list, or clicking through each category of clubs, such as “Academic”, “Fraternity/Sorority”, “Special Interest”, etc. Doing this will gather the links you need to scrape

19) Go to the “links” tab.20) Scroll through the links and find the links to the student organizations. Shift + select

them.21) Right-click the selection and select “Auto Explore”, “Fast Scrape”. You will be

automatically taken to the “scraped” tab.22) The scraper will scrape each link in turn. When it finishes, proceed to “Uploading Your

Results to ACT”.

Fig. 13 Multi-link scrape in progress

Uploading Your Results to ACT

This section describes the procedure for transferring your scraped results in OutWit to our database in ACT via an Excel spreadsheet.

1) Shift + select the scraped results and click “Catch”. The results will appear in a section below. In newer versions of OutWit, you can export the results directly to an Excel spreadsheet, but for older versions, you have to catch the results and copy and paste them yourself.

2) Select the catch results and copy them.3) Open a new Excel spreadsheet and paste the results.

23

Page 24: Detailed Scraping Instructions

4) Delete any unwanted columns. The columns that remain should be “Department”, “Contact,” “Email”, and possibly “Phone”.

5) Add three additional columns: “State”, “Company”, and “User 1”.6) In the “State” column, add the state wherein the school is located.7) In the “Company” column, add the name of the school.8) In the “User 1” column, add your initials and the date. For example, my name is Nate

Kurth and today is 22 June 2015, so I would add “nk06222015”.9) Save the spreadsheet in Dropbox under the latest “Scrapings” folder, such as “Scrapings

– Spring 2015”.

Fig. 14 Scraped information ready to upload to ACT

10) Close the spreadsheet.11) Go to ACT. Make sure you are in the WBDv3 database. The WBDv2 database is for

contacts who have signed up; the WBDv3 database is for mass emailing potential

24

Page 25: Detailed Scraping Instructions

contacts.12) Hit Alt + F12. This will open the Import wizard.13) Select “Excel” as the type of file you wish to import.14) Click “Browse”.15) Find the spreadsheet in Dropbox under the most recent “Scrapings” folder.16) Select “Open”.17) Click “Next”.18) Ensure that “Contact records” is selected and click “Next”.19) Select “Custom import” and click “Next”.20) Click “Next”.21) On this page you can use the right arrow to make sure that your info is lining up properly

—the school name should be mapped to the “Company” section in ACT, the email address should be mapped to the “Email” section, etc. Ensure that the information is mapped correctly. Also make sure that your User 1 date code is correct.

22) Click “Next”.23) Click “Next”. (NOTE: The very first time you upload anything to ACT, you will need to

adjust the settings here. You will only need to do this once. Click “Contact”. Ensure that “Contact records” is set to “Merge” on the left side and “Add” on the right side. Click “Okay”. Now you never have to worry about this part again!)

24) Click “Finish”.25) Once the green bar fills up, your scraped results have been uploaded to ACT!26) Go to the “Scrapings update” Doc. Enter today’s date in the most recent dated column

(for example, “Spring 2015”). Enter the number of emails you scraped in the most recent Emails column (for example, “Emails Spring 2015”).

Sending Mass Emails

1) Go to ACT.2) Click the dropdown menu under “Contact Field” on the left and select “User 1”3) Search for your date code for that day. For example, if your initials were SD and the date

was 14 February 2015, the code you should have been using for that day would have been sd02142015. This search will bring up a list of all the contacts you scraped that day.

4) There should be a column labeled “Unsubscribe”. Click the column header to sort the list by that field. (If there is no “Unsubscribe” column, you can add it. Click “Options” and select “Customize Columns” from the dropdown menu. In the left panel, labeled “Available fields”, find and select “Unsubscribe”. Click the “>>” to move the Unsubscribe field to the right panel. Click OK.)

5) Scroll to find any and all contacts that have been marked “unsubscribe”. There will not always be any unsubscribe contacts.

6) Shift + Select the unsubscribe contacts.

25

Page 26: Detailed Scraping Instructions

7) Right click the selection and click “Omit Selected Contacts”. This will clear those contacts off the list without deleting their information. We can’t delete these contacts entirely, or we would risk unknowingly rescraping their information and recontacting them. Instead, we mark them as “unsubscribe” and we make sure to clear them off the list before sending mass emails. Once you have cleared them off the list, you are ready to send the mass emails to all the other scraped contacts.

Fig. 15 Unsubscribers to be omitted from the list

8) Click “Write” and select “Mail Merge”. This will open up the email wizard.9) Ensure that “Email” is selected and click “Next”.10) Ensure that your mass email template is selected. If it is not, click “Browse” and find

your mass email template. It should be saved on the network under “UTD-MGR”, “WBDv3-database files”, “Templates”. If you need to set up a mass email template, consult the “Creating a Mass Email” section.

11) Click “Next”.12) Ensure that “Current lookup” is selected and click “Next”.13) Choose a descriptive subject line for your mass email, such as “Service and Fundraiser

Opportunity”. Ensure that “Email subject and message” is selected under the “Email record history type” dropdown menu and click “Next”.

26

Page 27: Detailed Scraping Instructions

14) Ensure that “Omit those records from the email mail merge” is selected and click “Next”.15) Click “Finish”.16) Please note that you will not be able to use ACT until the emails finish sending. Make

sure that you have completed all your tasks for the day before sending out the emails.

Creating a Mass Email

This section describes the procedures for creating a mass email template that you can use to reach out to multiple contacts in ACT simultaneously.

1) Go to Sage ACT.2) Click “Write” to open the dropdown menu.3) Select “New Letter/Email Template”. Microsoft Word will open automatically, along

with an extra popup labeled “Add Mail Merge Fields”.4) Write your email template. Be sure to include a salutation and your email signature.5) Whenever you would naturally use a person’s name or mention their organization or

school, you instead label that general information in carrots. a. For example, you can write “Hello <Contact>” when you would normally write

“Hello Cheryl”. When you send the mass emails, ACT will automatically fill in those blank spaces.

b. Instead of typing the carrots and field yourself, you may also double-click on the field you wish to add in the “Add Mail Merge Fields” popup.

c. If the “Add Mail Merge Fields” popup does not appear automatically or if you close it and wish to reopen it, you may open it manually. Go to the “Add-Ins” tab and click the “ACT!” dropdown menu. Select “Show Field List”.

6) Save the document as “Microsoft 2003, 2007, or 2010 Template (*.adt)” in the “Templates” folder.

7) Close the document.8) Go to ACT.9) Find the contact listing for one of your coworkers who is currently in the office with you.10) Send your mass email template to that coworker using the “Mail Merge” feature. Consult

the “Sending Mass Emails” section to learn how to do this. This way, you can test your template prior to sending it out to hundreds of people.

11) If your coworker receives the email and it looks good, congratulations! Your template is ready to use. Proceed to the “Sending Mass Emails” section to send the email to your contacts.

Glossary of Images

Fig. 1 "Scrapings update" Doc Main tab with all relevant columns displayed 4

27

Page 28: Detailed Scraping Instructions

Fig. 2 "Schools from ACT" tab in the "Scrapings update" Doc 5

Fig. 3 Sample of Type 1 Website 6

Fig. 4 Sample of Type 2 Website 7

Fig. 5 Sample of Type 3 Website 8

Fig. 6 Example of what you may see if your markers are too general 10

Fig. 7 Sample scraper 11

Fig. 8 Importing a previously-built scraper 12

Fig. 9 Actual page URL and the URL mapped to the scraper do not match 13

Fig. 10 Using the Find feature to search the source code 14

Fig. 11 Club links selected and ready to be scraped 16

Fig. 12 Using the Find feature to search for sample info in the code 18

Fig. 13 Multi-link scrape in progress 23

Fig. 14 Scraped information ready to upload to ACT 24

Fig. 15 Unsubscribers to be omitted from the list 26

28