week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages...
Transcript of week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages...
Pol-201Data-driven Electoral Analysis
Spring 2019 semester
Co-instructors: Gilles Verniers, Sudheendra HangalTA/Lab: Saloni Bhogale
�1
Scraping: RecapCopy-n-paste
Browser based tools
Fetch web pages (with curl) and extract data
Programmed web-scraper in R, Python. etc.
Automated browsers (Selenium)
Commercial web-hosted tools (e.g. import.io)
PDF table scraping with Tabula
�2
Data cleaning tools
Find-replace
Flash Fill in Excel
Open Refine
Surf
Commercial tools (Trifacta, Clearstory, etc.)
�3
Recap: Lab session
Regular expressions
Surf
Lab assignment due tomorrow (any questions?)
�4
3 classes of tools for tables
Spreadsheets (Excel, Google sheets, etc.): easiest to use, lowest flexibility, limited speed
R (and similar): millions of rows, good programmability and libraries
SQL: Billions of rows, optimized performance, hardest to use
�5
Introduction to R
�6
Why R?
An easy language to manipulate data
A more powerful environment than Excel
Captures your actions in the form of a small program… saves repeated effort
Large library of packages for data manipulation, analysis and visualization
�7
Why R?
“… a supercharged version of Microsoft’s Excel spreadsheet software that can help illuminate data trends more clearly than is possible by entering information in rows and columns.”
The New York Times, Jan 6, 2009
“Nothing particularly special about R, except for being the greatest software on earth.”
Amanda Cox, New York Times (podcast)
�8
Installing R
Install R (3.5 or above) from https://cran.rstudio.com/
Install Rstudio (frontend to R) from
https://www.rstudio.com/products/rstudio/download/
�9
�10
Rstudio: some tipsUse auto-complete to the extent possible
help(“function name”) for documentation
Beware: R is case sensitive
You can save the session when quitting and restart where you left off
Use View(data) command and filters/search on top
�11
Tabular data in R
3 ways:
data.frame
data.table (powerful extension of data.frame, we’ll use this)
dplyr library
�12
Libraries in R
Libraries extend the functionality in “Base R”
Like a plugin
To install a library, type in console (only once)
install.packages("data.table")
To use a library (each run):
library ("data.table")
�13
R VariablesVariables are named objects (like sheets)
GE = fread ("~/data/GE_mastersheet.csv")
GE_winners = GE[Position==1]
Remove a variable with rm(GE)
�14
R: types of collections
�15
Datacamp: Introduction to R
Lists
Create a list with
mylist = c(1,2,3)
mylist = c('pol201','cs101',1,2,3)
A data frame is a list of columns; each column is a vector
�16
More tips"Factors" = categorical values
Write your code in the script window and save it as a code.r file
Test NA values with is.na(…), numeric with is.numeric(…)
Execute selected code or current line in the script window with Ctrl-enter
Use long and clear names (separate words with _ or .)
Write comments in line starting with #
Use web search to look up syntax and information
�17
Using data.table
Read a data.table using:
GE = fread("GE_mastersheet.csv")
View it using:
View(GE)
�18Data tables: cheat sheet
Formulas with data.tableA column in a data frame/data table is accessed with Table$Column
e.g.,
GE$Votes refers to Votes column of table
Sum all the votes in the Votes column:
sum(GE$Votes), max(GE$Votes), min(GE$Votes)
Compute a new derived column:
GE$vote_share = 100*GE$Votes / GE$Valid_Votes
�19
Selecting rowsSelection syntax:
GE[<which rows>]
GE[1], GE[1:3], GE[c(1,3,5)]
GE[Sex == 'O'] == for equals
GE[Sex != 'NOTA'] != for "Not equals"
GE[Assembly_No==16 & Position == 1] & for AND
GE[Sex == 'M' | Sex == 'F'] | (vert. bar) for OR
�20
Selecting columns
Selection syntax:
GE[<which rows>,<which columns>]
GE[ , .(Candidate,Votes)] all rows, 2 columns
GE[Position==1,Candidate] cand. column for winners
�21
More examples
winners = GE[Position == 1]
winners_f = GE[Position == 1 & Sex == "f"]
winners_f = GE[Position == 1 & tolower(Sex) == "f"]
first_winner = GE[1]
first_three = GE[1:3]
Rows start from 1, not 0
�22
Sorting with data.table
Filter some rows based on a criterion
GE_sorted = GE[order(-Votes)]
order(…) returns a permutation of the rows
- before a column name indicates descending order
�23
Useful functionsnames(GE) # column names
str(GE) # types of columns
nrow(GE) or GE[,.N] # number of rows
length(GE) # number of columns
GE_double_row = rbind(GE,GE) # append rows
GE2_double_col = cbind(GE,GE) # append columns
�24
String operations
tolower()
toupper()
gsub("search_string", "replace_string", x)
�25
gsub
Let’s replace _ with space in State names
GE$State_Name = gsub ('_', ' ', GE$State_Name)
View(GE)
Regular expressions supported
�26
Uniques and duplicates
anyDuplicated, duplicated, unique
winners16 = GE[Assembly_No == 16 & Position == 1]
winners16$duplicate = duplicate (winners16$Candidate)
winners16[duplicate == TRUE]
unique (winners16$ID)
ifelse(anyDuplicated(winners16$ID), "are dups", "no dups")
�27
Pivot tables with RGE[<which rows>,
<which columns>,
<group by columns>]
Creates a row for every unique combination of group by column values
Group-by columns are usually categorical
Each row has group by columns along with <which columns>
<which columns> are usually aggregations of values in the rows with the specific value of the group-by columns
�28
R data tables syntax
GE[<which rows?, <which columns>, <group by columns>]
which columns: .(Col1=Value1, Col2=Value2,…)
Values are usually aggregations functions like sum(Votes)
or simple .N (gives count of # rows)
Group by columns: Colx - if single column, or .(ColX, ColY) if multiple columns
�29
winners16[, .(Seats=.N), by=.(Party)]
winners16[, .(Seats=.N), by=.(Party, State_Name)]
winners16[, .(Seats=.N,Candidate), by=.(Party, State_Name)]
(Note: Candidate is not an aggregation, so rows get repeated)
�30
Pivot tables with R
List total number of votes polled in each assembly:
votes_by_assembly = GE[ , .(total_votes = sum(Votes)), by=.(Assembly_No)]
List total number of votes and compare with Valid_Votes:
votes = GE[, .(total = sum(Votes),Valid_Votes), by=.(Constituency_No,Assembly_No,Poll_No)]
�31
Pivot tables with R
List difference between sum of votes and valid votes in each election
diffs = GE[, .(diff = sum(Votes)-Valid_Votes), by=.(Constituency_No, Assembly_No, Poll_No,State_Name)]
diffs[diff > 0]
�32
Pivot on state name; for each state compute # contestants (.N), uniques (length(unique(ID))), and their ratio
GE_by_contestants = GE[ ,
.( contestants=.N,
uniques=length(unique(ID)),
pct= (length(unique(ID))*100/.N)
),
by=c('State_Name')]
�33
Joins Merge rows in 2 tables if they have the same values for (col1, col2):
Inner join: only rows with same value in (col1, col2) in both tables retained
merge(table1, table2, by=c("col1", "col2"))
Outer join: all rows in both tables retained. Columns merged if (col1, col2) value is the same; otherwise NA filled in cells where data is absent
merge(table1, y=table2, by=c("col1", "col2"), all=true)
�34
Joins
Left join: all rows in first table retained, with NAs if needed in position of second table’s columns
merge(table1, table2, by=c("col1", "col2"), all.x=true)
Right join: opposite of left join
merge(table1, table2, by=c("col1", "col2"), all.y=true)
�35
Writing R scriptsUse plenty of whitespace (like copy-editing text)
Write parameters on different lines if complex
Make parantheses matching obvious
Use intermediate variables instead of long/complex expressions
Use meaningful and long names for variables
Separate pieces of logic into paragraphs
(Otherwise, plan to spend many hours debugging!)
�36
Printing and debugging
print ("hello, world")
print (paste("good", "morning", "world"))
(paste appends the 2 strings)
�37
Set operationsintersect, union, setdiff for set operations, e.g.,
winner_ids = GE[Position == 1]$ID
straggler_ids = GE[Position > 10]$ID
intersect (winner_ids, straggler_ids)
(Might point out wild swings in fortunes… or incorrect ID merges)
�38
Most common first name
counts = GE[,.(count=.N), by=first.name]
counts = counts[order(-count)]
counts = GE[Sex=='F',.(count=.N), by=first.name]
�39
All assemblies for a person
Find all assemblies the winners of LS 13-16 are present in
winners13_to_16 = GE[Position == 1 & Assembly_No >= 13 & Assembly_No <= 16]
winners13_to_16[, .(Assemblies=paste(Assembly_No, collapse=',')), by=ID]
�40
Were there any ties?
firsts = GE[Position==1]
seconds = GE[Position ==2]
merged = merge (first, seconds, by=c('State_Name','Constituency_No','Poll_No', 'Assembly_No', 'Votes'))
�41
Effect of SP-BSP alliance
�42
Benefits of R
You are expected to use R going forward
(Consult us in case of any difficulties)
Submit your R scripts along with your analysis so we can replicate it
�43