An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to...

54
An Introduction to Web Scraping with Python and DataCamp Olga Scrivner, Research Scientist, CNS, CEWIT WIM, February ,

Transcript of An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to...

Page 1: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

An Introduction to Web Scrapingwith Python and DataCamp

Olga Scrivner, Research Scientist, CNS, CEWITWIM, February 23, 2018

Credits: John D Lees-Miller (University of Bristol), CharlesBatts (University of North Carolina)

0

Page 2: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Objectives

Materials: DataCamp.com

} Review: Importing files

} Accessing Web

} Review: Processing text

} Practice, practice, practice!

1

Page 3: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Credits

} Hugo Bowne-Anderson - Importing Data in Python (Part 1and Part 2)

} Jeri Wieringa - Intro to Beautiful Soup

2

Page 4: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Importing Files

Page 5: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

File Types: Text

} Text files are structured as a sequence of lines

} Each line includes a sequence of characters

} Each line is terminated with a special character End of Line

4

Page 6: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Special Characters: Review

5

Page 7: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Special Characters: Answers

6

Page 8: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Modes

} Reading Mode

◦ ‘r’

} Writing Mode

◦ ‘w’

Quiz question: Why do we use quotes with ‘r’ and ‘w’?Answer: ‘r’ and ‘w’ are one-character strings

7

Page 9: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Modes

} Reading Mode

◦ ‘r’

} Writing Mode

◦ ‘w’

Quiz question: Why do we use quotes with ‘r’ and ‘w’?

Answer: ‘r’ and ‘w’ are one-character strings

7

Page 10: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Modes

} Reading Mode

◦ ‘r’

} Writing Mode

◦ ‘w’

Quiz question: Why do we use quotes with ‘r’ and ‘w’?Answer: ‘r’ and ‘w’ are one-character strings

7

Page 11: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Open - Close

} Open File - open(name, mode)

◦ name = ’filename’

◦ mode = ’r’ or mode = ’w’

8

Page 12: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Open New File

9

Page 13: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Open New File

9

Page 14: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Read File

} Read the Entire File - filename.read()

} Read ONE Line - filename.readline()

- Return the FIRST line- Return the THIRD line

} Read lines - filename.readlines()

What type of object and what is the length of this object?

10

Page 15: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Read File

} Read the Entire File - filename.read()

} Read ONE Line - filename.readline()

- Return the FIRST line- Return the THIRD line

} Read lines - filename.readlines()

What type of object and what is the length of this object?

10

Page 16: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Python Libraries

Page 17: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Import Modules (Libraries)

} Beautiful Soup

} urllib

} More in next slides ...

For installation - https://programminghistorian.org/lessons/intro-to-beautiful-soup

12

Page 18: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Review: Module I

To use external functions (modules), we need to import them:

1. Declare it at the top of the code

2. Use import

3. Call the module

13

Page 19: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Review: Modules II

To refer and import a specific function from the module

1. Declare it at the top pf the code

2. Use from import

3. Call the randint function from randommodule:random.randint()

14

Page 20: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

How to Import Packages with Modules

1. Install via a terminal or console◦ Type command prompt in window search◦ Type terminal in Mac search

2. Check your Python Version

3. Click return/enter

15

Page 21: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

How to Import Packages with Modules

1. Install via a terminal or console◦ Type command prompt in window search◦ Type terminal in Mac search

2. Check your Python Version

3. Click return/enter

15

Page 22: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Python 2 (pip) or Python 3 (pip3)

pip or pip3 - a tool for installing Python packages

To check if pip is installed:

https://packaging.python.org/tutorials/installing-packages/

16

Page 23: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Web Scraping Workflow

Page 24: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Web Concept

1. Import the necessary modules (functions)2. Specify URL3. Send a REQUEST4. Catch RESPONSE5. Return HTML as a STRING6. Close the RESPONSE

18

Page 25: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

URLs

1. URL - Uniform/Universal Resource Locator2. A URL for web addresses consists of two parts:

2.1 Protocol identifier - http: or https:2.2 Resource name - datacamp.com

3. HTTP - HyperText Transfer Protocol4. HTTPS - more secure form of HTTP5. Going to a website = sending HTTP request (GET request)6. HTML - HyperText Markup Language

19

Page 26: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

URLs

1. URL - Uniform/Universal Resource Locator2. A URL for web addresses consists of two parts:

2.1 Protocol identifier - http: or https:2.2 Resource name - datacamp.com

3. HTTP - HyperText Transfer Protocol4. HTTPS - more secure form of HTTP5. Going to a website = sending HTTP request (GET request)6. HTML - HyperText Markup Language

19

Page 27: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

URLs

1. URL - Uniform/Universal Resource Locator2. A URL for web addresses consists of two parts:

2.1 Protocol identifier - http: or https:2.2 Resource name - datacamp.com

3. HTTP - HyperText Transfer Protocol4. HTTPS - more secure form of HTTP5. Going to a website = sending HTTP request (GET request)6. HTML - HyperText Markup Language

19

Page 28: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

URLLIB package

Provide interface for getting data across the web. Instead of filenames we use URLS

Step 1 Install the package urllib (pip install urllib)Step 2 Import the function urlretrieve - to RETRIEVE urls

during the REQUESTStep 3 Create a variable url and provide the url link

url = ‘https:somepage’

Step 4 Save the retrieved document locallyStep 5 Read the file

20

Page 29: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Your Turn - DataCamp

DataCamp.com - create a free account using IU email

1. Log in2. Select Groups

3. Select RBootcampIU - see Jennifer if you do not see it

4. Go to Assignments and select Importing Data in Python

21

Page 30: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Today’s Practice

22

Page 31: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Importing Flat Files

urlretrieve has two arguments: url (input) and file name(output)

Example: urlretrieve(url, ‘file.name’)

23

Page 32: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Importing Flat Files

24

Page 33: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Opening and Reading Files

read_csv has two arguments: url and sep (separator)

pd.head()

25

Page 34: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Opening and Reading Files

read_csv has two arguments: url and sep (separator)

pd.head()

26

Page 35: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Importing Non-flat Files

read_excel has two arguments: url and sheetname

To read all sheets, sheetname = None

Let’s use a sheetname ’1700’27

Page 36: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Importing Non-flat Files

28

Page 37: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

HTTP Requests

read_excel has two arguments: url and sheetname

To read all sheets, sheetname = None

Let’s use a sheetname ’1700’29

Page 38: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

GET request

Import request package

30

Page 39: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

HTTP with urllib

31

Page 40: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

HTTP with urllib

32

Page 41: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Print HTTP with urllib

Use response.read()33

Page 42: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Print HTTP with urllib

34

Page 43: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Return Web as a String

Use r.text35

Page 44: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Return Web as a String

36

Page 45: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Scraping Web - HTML

37

Page 46: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Scraping Web - HTML

37

Page 47: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Scraping Web - BeautifulSoup Workflow

38

Page 48: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Many Useful Functions

} soup.title} soup.get_text()} soup.find_all(’a’)

39

Page 49: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Parsing HTML with BeautifulSoup

40

Page 50: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Parsing HTML with BeautifulSoup

41

Page 51: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Turning a Webpage into Data with BeautifulSoup

soup.title

soup.get_text() 42

Page 52: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Turning a Webpage into Data with BeautifulSoup

43

Page 53: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Turning a Webpage into Data - Hyperlinks

} HTML tag - <a>} find_all(’a’)} Collect all href: link.get(’href’)

44

Page 54: An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018  · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Turning a Webpage into Data - Hyperlinks

45