2017 Contributing to Open Elections Data using R
-
Upload
rupal-agrawal -
Category
Data & Analytics
-
view
149 -
download
1
Transcript of 2017 Contributing to Open Elections Data using R
Contributing to OpenElections(Open Data) Using R
Rupal Agrawal
BARUG Meetup
February 2017
Background - Election data in the US
• Election results are not reported by any single federal agency• Instead, each state & county reports in a variety of formats --
HTML, PDF, CSV, often with very different layouts and varying levels of granularity
• Number of elections, besides the Presidential – primaries for each party, mid-term and special for various offices (US Senate, US House, State legislatures, Governor, etc.)
• There is no freely available comprehensive source of officialelection results, for people to use for analysis or journalists for reporting
• Article: “Elections: The final frontier of open data?”• https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/
2
About OpenElections• Goal of this Open Data effort “to create the first free,
comprehensive, standardized, linked set of election data for the United States, including federal, statewide and state legislative offices”
• Website openelections.net (not current, need volunteers)• docs.openelections.net (instructions to contribute)
• Github Page (updated regularly)• Contains latest work in progress
• Separate repo for each state
• Processed data by year, election
• Instructions for contributors
• Contributed code/scripts mainly in Python
• Issue tracking
3
@openelex
Motivation
• I have been volunteering with OpenElections towards creating such a source
• I use R to automate some of these tasks - web-scraping, PDF conversion and for data manipulation to produce the desired outputs in a consistent format
• In this lightning talk, using real examples from multiple US states, I will highlight some of the challenges I faced
• I will also share some of the R packages I used – RSelenium, XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help others wishing to volunteer with similar Open Data efforts
4
Desired output format (csv)1. County2. Precinct (if available)3. Office (President, U.S. Senate, U.S. House, State
Senate, State House, Attorney General, etc)4. District (# for U.S. House district or State
Senate or State House district)5. Party (DEM, REP, LIB, OTH…)6. Candidate (names of candidates)7. Votes (# of votes received)
OpenElections specifies a standardized format for the desired output
5
Output Format
Let’s take a look at 4 US States
IOWA ALASKA
TENNESSEEMISSOURI
IOWA
Iowa 2016 General Election all Races at Precinct-level (txt file)
Let’s start with an easy caseSample Input Data is available as text file in Wide format - Data Manipulation only(shown below in Excel for ease of reading)
length(unique(long_DF$RaceTitle)) [1] 197
5000+columns
county+precinct+votetypeoffice+district
8
IOWA
Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row
long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119))
Sample relevant commands (actual code is more elaborate)
Challenges along the way• Countyprecinct in input file was
separated by “-” but precinct names also contained “-”
• Absentee, Polling & Total votes needed to be retained
9
IOWA
Split combined columns like RaceTitle and countyprecinct into individual columns
separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ")separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ")
cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes)
10
IOWA
ALASKA
Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only)
Like IA file shown earlier this is also a csv file but layout is different and new custom code is needed to process it
12
ALASKA
Even the same state changes format and layout of results from one election to nextAlaska 2012 General Election results in csv are only at District-level (and different layout/columns from 2016)
To get precinct-level results, need to process40 PDFs – one for each county (district)
13
ALASKA
Used• pdftools package - pdf_text• Tabulizer package - extract_tables, extract_text
Abandoned after trying out variety of ways to get a consistent pattern across multiple pages and files in order that I could extract data via a script
14
ALASKA
TENNESSEE
Tennessee 2004 General Elections
votes
officecandidate
party
county
precinct
candidateparty=OTH
1 single election results available in 4 distinct PDFs, each with dozens of pages
16
TENNESSEE
TENNESSEE
district
Multiple races in a single PDFVarying Number of candidates per raceDetermining where a new race has started is not straightforward
candidate
Click for TN election results websitehttp://sos.tn.gov/products/elections/election-results
17
Pseudo code for TN PDF
• Download file, read
• Convert PDF to free-form text
• Find separators for race, page, county• Determine number of races, pages, counties per race
• Determine number of candidates per race• Determine number of rows and columns taken up by
candidate names
• Find number of precincts by race• Tokenize and Compute number of words in each
precinct name
• Create list of candidates by district• Merge main data frame with candidates df• Remove unwanted rows
• Transform and standardize into desired format 18
TENNESSEE
txt <- pdf_text(filename)#' Store the whole pdf in one dataframe of 1 columndf <- read.csv(textConnection(txt), sep="\n", header=F,
stringsAsFactors = F)
## Find out how many candidates per Race & how many rows for candidate names## logic for num_cand is based on number of columns for vote counts## example, searching for row before "COUNTY" and see 1 2 3 4...and take max## logic for numrows_col1 is based on count of rows between race name ## & vote count column headers
a <- df %>%group_by(Race) %>%mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match
num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")),na.rm=T)),
numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, #diff = (num_cand == numrows_col1) # catch where num of candidates
# is diff from extra rows between race & vote headers) %>%
select(-key)
INITIAL DF
INTERMEDIATE DF
19
TENNESSEE
20
7 Candidates, listed in 2 columns, 5 rows
TENNESSEE
Candidate names in 4 rows, 3 columnsParty handled differently.
There is yet another example (not shown) with >10 candidates that a single row (precinct) goes across multiple pages!
Wrote a bunch of helper functions like these below
Input parameters
21
TENNESSEE
Multiple lines for a candidate
One of the many interesting challenges along the way
# create new df with names of candidates by districtc2 <- candidate_listcandidate_list <- b %>%
group_by(district) %>% slice(2:(numrows_col1 + 1)) %>% select(V1, district, num_cand, numrows_col1)
clean_cand <- create_list_candidates_and_numbers(candidate_list)
candidate_list1 <- clean_cand %>% separate(Candidate, c("Candidate", "party"),
sep = " . ") %>% unite(dist_cand, district, Number,
sep = "_Z_", remove = TRUE)
Input PDF
Appears as 2 candidates!
DF
Sample code
22
TENNESSEE
MISSOURI
Missouri 2016 Primary Elections (at county level) - HTML MO
24
100+ counties in dropdown
Note: URL doesn’t change with selections
25
MO
county
26
MO
office
candidate party votes
district
Convert and Transform table raw data into desired format
After 100+ html pages extraction and manipulation county-level (not precinct-level) data from 1 election ready!
27
MO
remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new
remoteDriver
remDrv$open() # open method in remoteDriver class
url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx'
# Simulate browser session and fill out form
remDrv$navigate(url) #send headless browser to url
#Select the Election from DROPDOWN using id in xpath
elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' ,
selected_election , ']')
remDrv$findElement(using = "xpath", elec_xp)$clickElement()
#election is set
# ---- Click the button to select the Election
eBTN <- '//*[@id="MainContent_btnElectionType"]'
remDrv$findElement(using = 'xpath', eBTN)$clickElement()
Use RSelenium package to simulate headless browser
• Initialize browser session• Go to URL• Select Election name from Dropdown• Click Choose Election button
• Select County name from Dropdown• Click Submit button
• Get HTML Data for selected Election and County• Process HTML and Extract Table • Convert to Raw Data (readHTMLTable())
• Transform raw data into desired format
• Repeat for all counties for that Election
## Get the HTML data from the page and process it using XML package
raw <- remDrv$getPageSource()[[1]]
counties_val <- xpathSApply(htmlParse(raw),
'//*[@id="cboCounty"]/option', xmlAttrs)
chosen_county <- grep("selected", counties_val)
#Extract the Table (Election results)
resTable <- raw %>% readHTMLTable()
resDf <- resTable[[1]] # return desired data frame from list
of tables
28
MO
Conclusions & Takeaways
• Great way to learn and contribute• Pdftools – Good package for extracting text data from PDFs • Tabulizer – Useful package for extracting tabular data from PDFs • RSelenium, XML – Great packages for web-scraping with (simulating) forms
• Lots of work still needs to be done for recent elections (2000-2016) across all states• 50 states, 100s of input files in a variety of formats per state• Meaningful analysis can be done by data scientists once data is available
• Presidential election results gets a lot of attention, but other races are arguably as important
29
Questions?
Rupal Agrawal [email protected]
30
@openelex
docs.openelections.net
Info on OpenElections:
https://github.com/openelections