2017 Contributing to Open Elections Data using R

Contributing to OpenElections(Open Data) Using R

Rupal Agrawal

BARUG Meetup

February 2017

https://www.meetup.com/R-Users/events/236866221/

Background - Election data in the US

• Election results are not reported by any single federal agency• Instead, each state & county reports in a variety of formats --

HTML, PDF, CSV, often with very different layouts and varying levels of granularity

• Number of elections, besides the Presidential – primaries for each party, mid-term and special for various offices (US Senate, US House, State legislatures, Governor, etc.)

• There is no freely available comprehensive source of officialelection results, for people to use for analysis or journalists for reporting

• Article: “Elections: The final frontier of open data?”• https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/

2

https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/

About OpenElections• Goal of this Open Data effort “to create the first free,

comprehensive, standardized, linked set of election data for the United States, including federal, statewide and state legislative offices”

• Website openelections.net (not current, need volunteers)• docs.openelections.net (instructions to contribute)

• Github Page (updated regularly)• Contains latest work in progress

• Separate repo for each state

• Processed data by year, election

• Instructions for contributors

• Contributed code/scripts mainly in Python

• Issue tracking

3

@openelex

http://docs.openelections.net/

https://github.com/openelections

Motivation

• I have been volunteering with OpenElections towards creating such a source

• I use R to automate some of these tasks - web-scraping, PDF conversion and for data manipulation to produce the desired outputs in a consistent format

• In this lightning talk, using real examples from multiple US states, I will highlight some of the challenges I faced

• I will also share some of the R packages I used – RSelenium, XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help others wishing to volunteer with similar Open Data efforts

4

Desired output format (csv)1. County2. Precinct (if available)3. Office (President, U.S. Senate, U.S. House, State

Senate, State House, Attorney General, etc)4. District (# for U.S. House district or State

Senate or State House district)5. Party (DEM, REP, LIB, OTH…)6. Candidate (names of candidates)7. Votes (# of votes received)

OpenElections specifies a standardized format for the desired output

5

Output Format

Let’s take a look at 4 US States

IOWA ALASKA

TENNESSEEMISSOURI

Iowa 2016 General Election all Races at Precinct-level (txt file)

Let’s start with an easy caseSample Input Data is available as text file in Wide format - Data Manipulation only(shown below in Excel for ease of reading)

length(unique(long_DF$RaceTitle)) [1] 197

5000+columns

county+precinct+votetypeoffice+district

8

IOWA

Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row

long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119))

Sample relevant commands (actual code is more elaborate)

Challenges along the way• Countyprecinct in input file was

separated by “-” but precinct names also contained “-”

• Absentee, Polling & Total votes needed to be retained

9

IOWA

Split combined columns like RaceTitle and countyprecinct into individual columns

separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ")separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ")

cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes)

10

IOWA

ALASKA

Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only)

Like IA file shown earlier this is also a csv file but layout is different and new custom code is needed to process it

12

ALASKA

Even the same state changes format and layout of results from one election to nextAlaska 2012 General Election results in csv are only at District-level (and different layout/columns from 2016)

To get precinct-level results, need to process40 PDFs – one for each county (district)

13

ALASKA

Used• pdftools package - pdf_text• Tabulizer package - extract_tables, extract_text

Abandoned after trying out variety of ways to get a consistent pattern across multiple pages and files in order that I could extract data via a script

14

ALASKA

TENNESSEE

Tennessee 2004 General Elections

votes

officecandidate

party

county

precinct

candidateparty=OTH

1 single election results available in 4 distinct PDFs, each with dozens of pages

16

TENNESSEE

TENNESSEE

district

Multiple races in a single PDFVarying Number of candidates per raceDetermining where a new race has started is not straightforward

candidate

Click for TN election results websitehttp://sos.tn.gov/products/elections/election-results

17

http://sos.tn.gov/products/elections/election-results

Pseudo code for TN PDF

• Download file, read

• Convert PDF to free-form text

• Find separators for race, page, county• Determine number of races, pages, counties per race

• Determine number of candidates per race• Determine number of rows and columns taken up by

candidate names

• Find number of precincts by race• Tokenize and Compute number of words in each

precinct name

• Create list of candidates by district• Merge main data frame with candidates df• Remove unwanted rows

• Transform and standardize into desired format 18

TENNESSEE

txt <- pdf_text(filename)#' Store the whole pdf in one dataframe of 1 columndf <- read.csv(textConnection(txt), sep="\n", header=F,

stringsAsFactors = F)

## Find out how many candidates per Race & how many rows for candidate names## logic for num_cand is based on number of columns for vote counts## example, searching for row before "COUNTY" and see 1 2 3 4...and take max## logic for numrows_col1 is based on count of rows between race name ## & vote count column headers

a <- df %>%group_by(Race) %>%mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match

num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")),na.rm=T)),

numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, #diff = (num_cand == numrows_col1) # catch where num of candidates

# is diff from extra rows between race & vote headers) %>%

select(-key)

INITIAL DF

INTERMEDIATE DF

19

TENNESSEE

20

7 Candidates, listed in 2 columns, 5 rows

TENNESSEE

Candidate names in 4 rows, 3 columnsParty handled differently.

There is yet another example (not shown) with >10 candidates that a single row (precinct) goes across multiple pages!

Wrote a bunch of helper functions like these below

Input parameters

21

TENNESSEE

Multiple lines for a candidate

One of the many interesting challenges along the way

# create new df with names of candidates by districtc2 <- candidate_listcandidate_list <- b %>%

group_by(district) %>% slice(2:(numrows_col1 + 1)) %>% select(V1, district, num_cand, numrows_col1)

clean_cand <- create_list_candidates_and_numbers(candidate_list)

candidate_list1 <- clean_cand %>% separate(Candidate, c("Candidate", "party"),

sep = " . ") %>% unite(dist_cand, district, Number,

sep = "_Z_", remove = TRUE)

Input PDF

Appears as 2 candidates!

DF

Sample code

22

TENNESSEE

MISSOURI

Missouri 2016 Primary Elections (at county level) - HTML MO

24

100+ counties in dropdown

Note: URL doesn’t change with selections

25

MO

county

26

MO

office

candidate party votes

district

Convert and Transform table raw data into desired format

After 100+ html pages extraction and manipulation county-level (not precinct-level) data from 1 election ready!

27

MO

remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new

remoteDriver

remDrv$open() # open method in remoteDriver class

url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx'

# Simulate browser session and fill out form

remDrv$navigate(url) #send headless browser to url

#Select the Election from DROPDOWN using id in xpath

elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' ,

selected_election , ']')

remDrv$findElement(using = "xpath", elec_xp)$clickElement()

#election is set

# ---- Click the button to select the Election

eBTN <- '//*[@id="MainContent_btnElectionType"]'

remDrv$findElement(using = 'xpath', eBTN)$clickElement()

Use RSelenium package to simulate headless browser

• Initialize browser session• Go to URL• Select Election name from Dropdown• Click Choose Election button

• Select County name from Dropdown• Click Submit button

• Get HTML Data for selected Election and County• Process HTML and Extract Table • Convert to Raw Data (readHTMLTable())

• Transform raw data into desired format

• Repeat for all counties for that Election

## Get the HTML data from the page and process it using XML package

raw <- remDrv$getPageSource()[[1]]

counties_val <- xpathSApply(htmlParse(raw),

'//*[@id="cboCounty"]/option', xmlAttrs)

chosen_county <- grep("selected", counties_val)

#Extract the Table (Election results)

resTable <- raw %>% readHTMLTable()

resDf <- resTable[[1]] # return desired data frame from list

of tables

28

MO

Conclusions & Takeaways

• Great way to learn and contribute• Pdftools – Good package for extracting text data from PDFs • Tabulizer – Useful package for extracting tabular data from PDFs • RSelenium, XML – Great packages for web-scraping with (simulating) forms

• Lots of work still needs to be done for recent elections (2000-2016) across all states• 50 states, 100s of input files in a variety of formats per state• Meaningful analysis can be done by data scientists once data is available

• Presidential election results gets a lot of attention, but other races are arguably as important

29

Questions?

Rupal Agrawal [email protected]

30

@openelex

docs.openelections.net

Info on OpenElections:

https://github.com/openelections

2017 Contributing to Open Elections Data using R

Data & Analytics

Transcript of 2017 Contributing to Open Elections Data using R