Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

51
Hadley Wickham @hadleywickham Chief Scientist, RStudio Pipelines for data analysis in R October 2015

Transcript of Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new...

Page 2: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Data analysis is the process by which data becomes

understanding, knowledge and insight

Data analysis is the process by which data becomes

understanding, knowledge and insight

Page 3: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Data analysis is the process by which data becomes

understanding, knowledge and insight

Data analysis is the process by which data becomes

understanding, knowledge and insight

Page 4: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

Tidy

Import

Surprises, but doesn't scale

Scales, but doesn't (fundamentally) surprise

Create new variables & new summariesConsistent way of storing data

Page 5: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

Importreadr readxl haven DBI httr

broom

ggplot2 ggvis

Page 6: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Pipelines

Page 7: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Think it Do itDescribe it

Cognitive

Computational

(precisely)

Page 8: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Cognition time ≫ Computation time

http://ww

w.flickr.com/photos/m

utsmuts/4695658106

Page 9: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

%>%Inspirations: unix, F#, haskell, clojure, method chaining

magrittr::

Page 10: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

foo_foo <- little_bunny()

bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head )

# vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

Page 11: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

x %>% f(y) # f(x, y)

x %>% f(z, .) # f(z, x)

x %>% f(y) %>% g(z) # g(f(x, y), z)

# Turns function composition (hard to read) # into sequence (easy to read)

Page 12: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result.

# tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

Page 13: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Tidy

Page 14: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

Importreadr readxl haven DBI httr

ggplot2 ggvis

broom

Page 15: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Storage Meaning

Table / File Data set

Rows Observations

Columns Variables

Tidy data = data that makes data analysis easy

Page 16: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Source: local data frame [5,769 x 22]

iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)

What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)

Page 17: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables

tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

Page 18: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3

# Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

Page 19: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Google for “tidyr” &

“tidy data”

Page 20: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Page 21: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

ggplot2 ggvis

broom

Importreadr readxl haven DBI httr

Page 22: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Think it Do itDescribe it

Cognitive

Computational

(precisely)

Page 23: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

One table verbs

• select: subset variables by name• filter: subset observations by value• mutate: add new variables• summarise: reduce to a single obs• arrange: re-order the observations

+ group by

Page 24: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Demo

Page 25: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

right_join()

full_join()

inner_join()

left_join()

Mutating

semi_join()

anti_join()

Filtering Set

intersect()

setdiff()

union()

Page 26: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

dplyr sources• Local data frame (C++)• Local data table• Local data cube (experimental)• RDMS: Postgres, MySQL, SQLite,

Oracle, MS SQL, JDBC, Impala• MonetDB, BigQuery

Page 27: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Google for “dplyr”

Page 28: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Visualise

Page 29: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

ggplot2 ggvis

broom

Importreadr readxl haven DBI httr

Page 30: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

What is ggvis?• A grammar of graphics

(like ggplot2)

• Reactive (interactive & dynamic) (like shiny)

• A pipeline (a la dplyr)

•Of the web (drawn with vega)

Page 31: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Demo4-ggvis.R 4-ggvis.Rmd

Page 32: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Google for “ggvis”

Page 33: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Modelwith broom, by David Robinson

Page 34: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

ggplot2 ggvis

broom

Importreadr readxl haven DBI httr

Page 35: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

2.5

5.0

7.5

1990 1995 2000 2005 2010 2015date

log(sales)

46 TX cities, ~25 years of data

What makes it hard to see the long term

trend?

Page 36: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

# Models are useful as tool for removing # known patterns

tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

Page 37: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

−2

−1

0

1

1990 1995 2000 2005 2010 2015date

resid

Page 38: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

# Models are also useful in their own right

models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

Page 39: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Model summaries

• Model level: one row per model• Coefficient level: one row per

coefficient (per model)• Observation level: one row per

observation (per model)

Page 40: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Demo5-broom.R

Page 41: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Google for “broom r”

Page 42: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Big data and R

Page 43: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Big Can’t fit in memory on one computer: >5 TB

Medium Fits in memory on a server: 10 GB-5 TB

Small Fits in memory on a laptop: <10 GB

R is great at this!

Page 44: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

R• R provides an excellent environment for

rapid interactive exploration of small data.

• There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done)

• What about big data?

Page 45: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

1. Can be reduced to a small data problem with subsetting/sampling/summarising (90%)

2. Can be reduced to a very large number of small data problems (9%)

3. Is irreducibly big (1%)

Page 46: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

The right small data

• Rapid iteration essential• dplyr supports this activity by avoiding

cognitive costs of switching between languages.

Page 47: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Lots of small problems

• Embarrassingly parallel (e.g. Hadoop)• R wrappers like foreach, rhipe, rhadoop• Challenging is matching architecture of

computing to data storage

Page 48: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Irreducibly big

• Computation must be performed by specialised system.

• Typically C/C++, Fortran, Scala.• R needs to be able to talk to those

systems.

Page 49: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Future work

Page 50: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

End gameProvide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process.The best tools become invisible with time!Still a lot of work to do, especially on the connection between modelling and visualisation.

Page 51: Pipelines for data analysis in Rfiles.meetup.com/1406240/pipelines.pdf · Consistent way Create new variables & new summaries of storing data. Transform Visualise Model tidyr dplyr

Transform

Visualise

Model

tidyr dplyrTidy

Importreadr readxl haven DBI httr

broom

ggplot2 ggvis