Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

Hadley Wickham @hadleywickham Chief Scientist, RStudio

Pipelines for data analysis in R

October 2015

Data analysis is the process by which data becomes

understanding, knowledge and insight

Transform

Visualise

Import

Surprises, but doesn't scale

Scales, but doesn't (fundamentally) surprise

Create new variables & new summariesConsistent way of storing data

Transform

Visualise

tidyr dplyrTidy

Importreadr readxl haven DBI httr

ggplot2 ggvis

Pipelines

Think it Do itDescribe it

Cognitive

Computational

(precisely)

Cognition time ≫ Computation time

http://ww

w.flickr.com/photos/m

utsmuts/4695658106

%>%Inspirations: unix, F#, haskell, clojure, method chaining

magrittr::

foo_foo <- little_bunny()

bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head )

# vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

x %>% f(y) # f(x, y)

x %>% f(z, .) # f(z, x)

x %>% f(y) %>% g(z) # g(f(x, y), z)

# Turns function composition (hard to read) # into sequence (easy to read)

# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result.

# tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Storage Meaning

Table / File Data set

Rows Observations

Columns Variables

Tidy data = data that makes data analysis easy

Source: local data frame [5,769 x 22]

iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)

What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)

# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables

tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3

# Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

Google for “tidyr” &

“tidy data”

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Think it Do itDescribe it

Cognitive

Computational

(precisely)

One table verbs

• select: subset variables by name• filter: subset observations by value• mutate: add new variables• summarise: reduce to a single obs• arrange: re-order the observations

+ group by

right_join()

full_join()

inner_join()

left_join()

Mutating

semi_join()

anti_join()

Filtering Set

intersect()

setdiff()

union()

dplyr sources• Local data frame (C++)• Local data table• Local data cube (experimental)• RDMS: Postgres, MySQL, SQLite,

Oracle, MS SQL, JDBC, Impala• MonetDB, BigQuery

Google for “dplyr”

Visualise

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

What is ggvis?• A grammar of graphics

(like ggplot2)

• Reactive (interactive & dynamic) (like shiny)

• A pipeline (a la dplyr)

•Of the web (drawn with vega)

Demo4-ggvis.R 4-ggvis.Rmd

Google for “ggvis”

Modelwith broom, by David Robinson

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

1990 1995 2000 2005 2010 2015date

log(sales)

46 TX cities, ~25 years of data

What makes it hard to see the long term

trend?

# Models are useful as tool for removing # known patterns

tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

1990 1995 2000 2005 2010 2015date

# Models are also useful in their own right

models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

Model summaries

• Model level: one row per model• Coefficient level: one row per

coefficient (per model)• Observation level: one row per

observation (per model)

Demo5-broom.R

Google for “broom r”

Big data and R

Big Can’t fit in memory on one computer: >5 TB

Medium Fits in memory on a server: 10 GB-5 TB

Small Fits in memory on a laptop: <10 GB

R is great at this!

R• R provides an excellent environment for

rapid interactive exploration of small data.

• There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done)

• What about big data?

1. Can be reduced to a small data problem with subsetting/sampling/summarising (90%)

2. Can be reduced to a very large number of small data problems (9%)

3. Is irreducibly big (1%)

The right small data

• Rapid iteration essential• dplyr supports this activity by avoiding

cognitive costs of switching between languages.

Lots of small problems

• Embarrassingly parallel (e.g. Hadoop)• R wrappers like foreach, rhipe, rhadoop• Challenging is matching architecture of

computing to data storage

Irreducibly big

• Computation must be performed by specialised system.

• Typically C/C++, Fortran, Scala.• R needs to be able to talk to those

systems.

Future work

End gameProvide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process.The best tools become invisible with time!Still a lot of work to do, especially on the connection between modelling and visualisation.

Transform

Visualise

tidyr dplyrTidy

ggplot2 ggvis

Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

Documents

Transcript of Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

Neural Turing Machines - Meetupfiles.meetup.com/1406240/2016-06-23_ML-NY-meetup.pdf · Conclusion • The NTM is able to learn algorithms only from examples • It shows better generalization

Data Wrangling with R - Statistics · © 2014 RStudio, Inc. All rights reserved. slides at: bit.ly/wrangling-webinar Two packages to help you work with the structure of data. tidyr

Better-For-You Cuisine The Challenge · Open source & enterprise-ready professional so"ware for R. to explore amongst multiple variables. With the packages dpylr, tidyr and ggplot2

Managing Data with dplyr and tidyr - Biostatistics 140 · dplyr Verbs I select: returnasubsetofthecolumnsofadataframe I filter: extractasubsetofrowsfromadataframebasedon logicalconditions

The basics of ggplot2 - Yan Holtz · tidyverse' go Cm n tidyr m ggplot2 dplyr Visualise purrr readr haven kn itr Model Import Tidy readxl data.frames Transform Explore factors —.--.....Þ

Working in the Tidyverse - HIE R Course · In this following example, functions from four diﬀerent Tidyverse packages (janitor, tibble, tidyr, and dplyr)areworkingtogetherinasingle“pipeline”,wherethe

Obtaining, Scrubbing, and Exploring Data at the Command Linefiles.meetup.com/1406240/Command Line Data Science.pdf · 2014-01-31 · Motivation. . . . . . . . . . . . EssentialToolsandConcepts.

files.meetup.comfiles.meetup.com/1406240/Bayesian Intro.pdf · The abandonment of superstitious beliefs about the existence of the Phlogiston, the Cosmic Ether, Absolute Space and

From Lisp to Clojure/Incanter and R - Meetupfiles.meetup.com/1406240/From+Lisp+to+Clojure-Incanter+and+R.pdf · From Lisp to Clojure/Incanter and R An Introduction ... Comparison

The tidyverse - Meetupfiles.meetup.com/1406240/Tidyverse.pdfThe tidyverse September 2016. Tidy Import Surprises, but doesn't scale Consistent way of Create new variables & new summaries

earl 2015 - mapping census data in rfiles.meetup.com/1406240/Mapping Census Data.pdf · earl 2015 - mapping census data in r.key Created Date: 10/17/2015 4:49:18 PM ...

what is a computational biologist doing at the New York Times? …files.meetup.com/1406240/NY Times Data Science.pdf · 2014. 5. 1. · what is a computational biologist doing at

How to Wrangle Data - University of Torontobutler/climate-lab/wrangle.pdf · 2015-03-30 · How to Wrangle Data using R with tidyr and dplyr Ken Butler March 30, 2015 1/44. ... :

H D Rfiles.meetup.com/1768524/HDR.pdf · RAW converter. •Click Alt > Open image. (open image changes to open copy) •Save as a JPG or TIF. (use an name like “image-N“) •Open

APPLICATIONS OF A CNN A C O F - Center for Statistics and ... · R Packages and Toolkits: Modeling: keras, tensorflow Data Wrangling: magrittr, dplyr, lubridate, stringr, tidyr, purrr,

day7 nonparametric tidyr - sjspielman.github.io · > binom.test(2, 7, 0.5, alternative = "less") Exact binomial test data:2 and 7 number of successes = 2, number of trials = 7, p-value

Designing and Creating applications built on Rfiles.meetup.com/1781511/AppsInR_MangoSolutions.pdf · 2012-11-23 · Designing and Creating applications built on R Richard Pugh, Andy

Learning R via Python or the other way aroundfiles.meetup.com/1406240/Learning R Via Python (or... · If the implementation is hard to explain, it’s a bad idea. If the implementation

Data munging with SQL and R - files.meetup.comfiles.meetup.com/1406240/Data munging with SQL and R.pdf · OutlineOverview of SQLThe same query in RThere are better waysdata.table

Social Network Analysis in R - Meetupfiles.meetup.com/1406240/sna_in_R.pdf · Drew Conway Social Network Analysis in R. Why use R to do SNA? Examples of SNA in R Additional Resources