week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages...

43
Pol-201 Data-driven Electoral Analysis Spring 2019 semester Co-instructors: Gilles Verniers, Sudheendra Hangal TA/Lab: Saloni Bhogale 1

Transcript of week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages...

Page 1: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Pol-201Data-driven Electoral Analysis

Spring 2019 semester

Co-instructors: Gilles Verniers, Sudheendra HangalTA/Lab: Saloni Bhogale

�1

Page 2: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Scraping: RecapCopy-n-paste

Browser based tools

Fetch web pages (with curl) and extract data

Programmed web-scraper in R, Python. etc.

Automated browsers (Selenium)

Commercial web-hosted tools (e.g. import.io)

PDF table scraping with Tabula

�2

Page 3: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Data cleaning tools

Find-replace

Flash Fill in Excel

Open Refine

Surf

Commercial tools (Trifacta, Clearstory, etc.)

�3

Page 4: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Recap: Lab session

Regular expressions

Surf

Lab assignment due tomorrow (any questions?)

�4

Page 5: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

3 classes of tools for tables

Spreadsheets (Excel, Google sheets, etc.): easiest to use, lowest flexibility, limited speed

R (and similar): millions of rows, good programmability and libraries

SQL: Billions of rows, optimized performance, hardest to use

�5

Page 6: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Introduction to R

�6

Page 7: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Why R?

An easy language to manipulate data

A more powerful environment than Excel

Captures your actions in the form of a small program… saves repeated effort

Large library of packages for data manipulation, analysis and visualization

�7

Page 8: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Why R?

“… a supercharged version of Microsoft’s Excel spreadsheet software that can help illuminate data trends more clearly than is possible by entering information in rows and columns.”

The New York Times, Jan 6, 2009

“Nothing particularly special about R, except for being the greatest software on earth.”

Amanda Cox, New York Times (podcast)

�8

Page 9: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Installing R

Install R (3.5 or above) from https://cran.rstudio.com/

Install Rstudio (frontend to R) from

https://www.rstudio.com/products/rstudio/download/

�9

Page 10: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

�10

Page 11: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Rstudio: some tipsUse auto-complete to the extent possible

help(“function name”) for documentation

Beware: R is case sensitive

You can save the session when quitting and restart where you left off

Use View(data) command and filters/search on top

�11

Page 12: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Tabular data in R

3 ways:

data.frame

data.table (powerful extension of data.frame, we’ll use this)

dplyr library

�12

Page 13: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Libraries in R

Libraries extend the functionality in “Base R”

Like a plugin

To install a library, type in console (only once)

install.packages("data.table")

To use a library (each run):

library ("data.table")

�13

Page 14: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

R VariablesVariables are named objects (like sheets)

GE = fread ("~/data/GE_mastersheet.csv")

GE_winners = GE[Position==1]

Remove a variable with rm(GE)

�14

Page 15: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

R: types of collections

�15

Datacamp: Introduction to R

Page 16: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Lists

Create a list with

mylist = c(1,2,3)

mylist = c('pol201','cs101',1,2,3)

A data frame is a list of columns; each column is a vector

�16

Page 17: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

More tips"Factors" = categorical values

Write your code in the script window and save it as a code.r file

Test NA values with is.na(…), numeric with is.numeric(…)

Execute selected code or current line in the script window with Ctrl-enter

Use long and clear names (separate words with _ or .)

Write comments in line starting with #

Use web search to look up syntax and information

�17

Saloni Bhogale
Page 18: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Using data.table

Read a data.table using:

GE = fread("GE_mastersheet.csv")

View it using:

View(GE)

�18Data tables: cheat sheet

Page 19: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Formulas with data.tableA column in a data frame/data table is accessed with Table$Column

e.g.,

GE$Votes refers to Votes column of table

Sum all the votes in the Votes column:

sum(GE$Votes), max(GE$Votes), min(GE$Votes)

Compute a new derived column:

GE$vote_share = 100*GE$Votes / GE$Valid_Votes

�19

Page 20: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Selecting rowsSelection syntax:

GE[<which rows>]

GE[1], GE[1:3], GE[c(1,3,5)]

GE[Sex == 'O'] == for equals

GE[Sex != 'NOTA'] != for "Not equals"

GE[Assembly_No==16 & Position == 1] & for AND

GE[Sex == 'M' | Sex == 'F'] | (vert. bar) for OR

�20

Page 21: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Selecting columns

Selection syntax:

GE[<which rows>,<which columns>]

GE[ , .(Candidate,Votes)] all rows, 2 columns

GE[Position==1,Candidate] cand. column for winners

�21

Page 22: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

More examples

winners = GE[Position == 1]

winners_f = GE[Position == 1 & Sex == "f"]

winners_f = GE[Position == 1 & tolower(Sex) == "f"]

first_winner = GE[1]

first_three = GE[1:3]

Rows start from 1, not 0

�22

Page 23: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Sorting with data.table

Filter some rows based on a criterion

GE_sorted = GE[order(-Votes)]

order(…) returns a permutation of the rows

- before a column name indicates descending order

�23

Page 24: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Useful functionsnames(GE) # column names

str(GE) # types of columns

nrow(GE) or GE[,.N] # number of rows

length(GE) # number of columns

GE_double_row = rbind(GE,GE) # append rows

GE2_double_col = cbind(GE,GE) # append columns

�24

Page 25: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

String operations

tolower()

toupper()

gsub("search_string", "replace_string", x)

�25

Page 26: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

gsub

Let’s replace _ with space in State names

GE$State_Name = gsub ('_', ' ', GE$State_Name)

View(GE)

Regular expressions supported

�26

Page 27: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Uniques and duplicates

anyDuplicated, duplicated, unique

winners16 = GE[Assembly_No == 16 & Position == 1]

winners16$duplicate = duplicate (winners16$Candidate)

winners16[duplicate == TRUE]

unique (winners16$ID)

ifelse(anyDuplicated(winners16$ID), "are dups", "no dups")

�27

Saloni Bhogale
Page 28: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Pivot tables with RGE[<which rows>,

<which columns>,

<group by columns>]

Creates a row for every unique combination of group by column values

Group-by columns are usually categorical

Each row has group by columns along with <which columns>

<which columns> are usually aggregations of values in the rows with the specific value of the group-by columns

�28

Page 29: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

R data tables syntax

GE[<which rows?, <which columns>, <group by columns>]

which columns: .(Col1=Value1, Col2=Value2,…)

Values are usually aggregations functions like sum(Votes)

or simple .N (gives count of # rows)

Group by columns: Colx - if single column, or .(ColX, ColY) if multiple columns

�29

Page 30: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

winners16[, .(Seats=.N), by=.(Party)]

winners16[, .(Seats=.N), by=.(Party, State_Name)]

winners16[, .(Seats=.N,Candidate), by=.(Party, State_Name)]

(Note: Candidate is not an aggregation, so rows get repeated)

�30

Saloni Bhogale
Page 31: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Pivot tables with R

List total number of votes polled in each assembly:

votes_by_assembly = GE[ , .(total_votes = sum(Votes)), by=.(Assembly_No)]

List total number of votes and compare with Valid_Votes:

votes = GE[, .(total = sum(Votes),Valid_Votes), by=.(Constituency_No,Assembly_No,Poll_No)]

�31

Page 32: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Pivot tables with R

List difference between sum of votes and valid votes in each election

diffs = GE[, .(diff = sum(Votes)-Valid_Votes), by=.(Constituency_No, Assembly_No, Poll_No,State_Name)]

diffs[diff > 0]

�32

Page 33: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Pivot on state name; for each state compute # contestants (.N), uniques (length(unique(ID))), and their ratio

GE_by_contestants = GE[ ,

.( contestants=.N,

uniques=length(unique(ID)),

pct= (length(unique(ID))*100/.N)

),

by=c('State_Name')]

�33

Page 34: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Joins Merge rows in 2 tables if they have the same values for (col1, col2):

Inner join: only rows with same value in (col1, col2) in both tables retained

merge(table1, table2, by=c("col1", "col2"))

Outer join: all rows in both tables retained. Columns merged if (col1, col2) value is the same; otherwise NA filled in cells where data is absent

merge(table1, y=table2, by=c("col1", "col2"), all=true)

�34

Page 35: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Joins

Left join: all rows in first table retained, with NAs if needed in position of second table’s columns

merge(table1, table2, by=c("col1", "col2"), all.x=true)

Right join: opposite of left join

merge(table1, table2, by=c("col1", "col2"), all.y=true)

�35

Page 36: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Writing R scriptsUse plenty of whitespace (like copy-editing text)

Write parameters on different lines if complex

Make parantheses matching obvious

Use intermediate variables instead of long/complex expressions

Use meaningful and long names for variables

Separate pieces of logic into paragraphs

(Otherwise, plan to spend many hours debugging!)

�36

Page 37: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Printing and debugging

print ("hello, world")

print (paste("good", "morning", "world"))

(paste appends the 2 strings)

�37

Page 38: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Set operationsintersect, union, setdiff for set operations, e.g.,

winner_ids = GE[Position == 1]$ID

straggler_ids = GE[Position > 10]$ID

intersect (winner_ids, straggler_ids)

(Might point out wild swings in fortunes… or incorrect ID merges)

�38

Page 39: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Most common first name

counts = GE[,.(count=.N), by=first.name]

counts = counts[order(-count)]

counts = GE[Sex=='F',.(count=.N), by=first.name]

�39

Page 40: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

All assemblies for a person

Find all assemblies the winners of LS 13-16 are present in

winners13_to_16 = GE[Position == 1 & Assembly_No >= 13 & Assembly_No <= 16]

winners13_to_16[, .(Assemblies=paste(Assembly_No, collapse=',')), by=ID]

�40

Page 41: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Were there any ties?

firsts = GE[Position==1]

seconds = GE[Position ==2]

merged = merge (first, seconds, by=c('State_Name','Constituency_No','Poll_No', 'Assembly_No', 'Votes'))

�41

Page 42: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Effect of SP-BSP alliance

�42

Page 43: week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Benefits of R

You are expected to use R going forward

(Consult us in case of any difficulties)

Submit your R scripts along with your analysis so we can replicate it

�43