[系列活動] Data exploration with modern R

61
Exploring data with modern R Winston Chang RStudio 2016–12–21

Transcript of [系列活動] Data exploration with modern R

Page 1: [系列活動] Data exploration with modern R

Exploring data with modern R

Winston Chang RStudio

2016–12–21

Page 2: [系列活動] Data exploration with modern R

https://hea-www.harvard.edu/~fine/Observatory/women.html

Page 3: [系列活動] Data exploration with modern R

Modern R

Page 4: [系列活動] Data exploration with modern R

A brief history of R• In the beginning, there was S. Developed at

Bell Labs in the 1970’s.• S was owned and licensed by AT&T• In 1990’s, two professors from New Zealand

created a free, open source reimplementation of S, called R

• Many of the unusual features of R exist because they came from S

• R itself is somewhat different from S and has a very flexible syntax

Page 5: [系列活動] Data exploration with modern R

install.packages("tidyverse") # Automatically installs ggplot2, dplyr, tidyr, # and others.

library(tidyverse) tidyverse_update() # Update all tidyverse pacakges to the latest # version.

The tidyverse

Page 6: [系列活動] Data exploration with modern R

Getting started

Page 7: [系列活動] Data exploration with modern R

faithful

head(faithful)

str(faithful)

View(faithful) # In RStudio

Looking at data with R

Page 8: [系列活動] Data exploration with modern R

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

●●

● ●

●●

50

60

70

80

90

2 3 4 5eruptions

waiting

library(ggplot2)

ggplot(data=faithful, mapping=aes(x=eruptions, y=waiting)) + geom_point()

# More concisely: ggplot(faithful, aes(eruptions, waiting)) + geom_point()

Page 9: [系列活動] Data exploration with modern R

0

5

10

15

20

25

2 3 4 5eruptions

count

0

10

20

30

40

2 3 4 5eruptions

count

ggplot(faithful, aes(x=eruptions)) + geom_histogram()

ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=.25)

Page 10: [系列活動] Data exploration with modern R

Your turnInspect the diamonds data set.With diamonds, make a histogram of the carat variable. Experiment with different bin sizes. What patterns do you see?

Inspect the mpg data set. With mpg, make a scatter plot showing the relationship between displ and hwy.

Page 11: [系列活動] Data exploration with modern R

0

5000

10000

15000

0 1 2 3 4 5carat

count

ggplot(diamonds, aes(x=carat)) + geom_histogram()

Page 12: [系列活動] Data exploration with modern R

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.3)

0

5000

10000

15000

0 1 2 3 4 5carat

count

Page 13: [系列活動] Data exploration with modern R

0

5000

10000

0 1 2 3 4 5carat

count

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.25)

Page 14: [系列活動] Data exploration with modern R

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.01)

0

1000

2000

0 1 2 3 4 5carat

count

Page 15: [系列活動] Data exploration with modern R

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()

20

30

40

2 3 4 5 6 7displ

hwy

Page 16: [系列活動] Data exploration with modern R

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)

10

20

30

40

2 3 4 5 6 7displ

hwy

Page 17: [系列活動] Data exploration with modern R

head(mpg)

str(mpg)

View(mpg)

Page 18: [系列活動] Data exploration with modern R

ggplot(mpg, aes(x=displ, y=hwy, color=drv)) + geom_point()

20

30

40

2 3 4 5 6 7displ

hwy

drv4

f

r

Page 19: [系列活動] Data exploration with modern R

20

30

40

2 3 4 5 6 7displ

hwy

class2seater

compact

midsize

minivan

pickup

subcompact

suv

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + geom_point()

Page 20: [系列活動] Data exploration with modern R

Your turn

What happens if you use shape instead of color?

Run ?geom_smooth to see the documentation. Then remove the confidence region from the model line. What happens if you add a model line and map a variable to color?

Page 21: [系列活動] Data exploration with modern R

Faceting

Page 22: [系列活動] Data exploration with modern R

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

2 3 4 5 6 7

2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

Page 23: [系列活動] Data exploration with modern R

●●●●

●●

●● ●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●● ●●

●●

●●●

●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●

●●●●

●●●

●●●●

●●

●●

●●●

● ●

●●

● ●●●

●●

●●●

●●●

●●

●●

●●●●● ●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

4 5 6 8

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(. ~ cyl)

●●

●●

●● ●●●

●●

●●

●●●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●● ●

●●●●●●

●●

●●

●●

●●

●●

●●● ●

●●●

●●

●●

●●●

●● ●●●●●●

●●●●

●●

●●●●

●●

●●

●●●●

●●●●

●● ●

●●●●

●●●●

●●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

● ●● ●

●●

●●

●●

●●●

●●●●

●●●

●●●●

●●

●● ●●

●●

●●●●●

●●

●● ●

● ●

●●

● ●●

●●●●

●●●●

●●●

4f

r

2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ .)

Page 24: [系列活動] Data exploration with modern R

●●

●●

●●

●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●● ●●

●●

●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●●

●●●

● ●● ●

●●

●●●

●●●

●●●●● ●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●●●

●●●

4 5 6 8

4f

r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

hwy

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ cyl)

cyl

drv

Page 25: [系列活動] Data exploration with modern R

Your turn

Try faceting with a histogram

Page 26: [系列活動] Data exploration with modern R

ggplot2 concepts

Page 27: [系列活動] Data exploration with modern R

Geoms

Points

Lines

Bars

Error bars

Box plot

Page 28: [系列活動] Data exploration with modern R

Aesthetics

Y position

X position

Color

Size

Page 29: [系列活動] Data exploration with modern R

Aesthetics

Page 30: [系列活動] Data exploration with modern R

Mapping data values to aesthetics

2

4

6

8

2 3 4 5 6 7var1

var2

2

4

6

8

2 3 4 5 6 7var1

var2

012345

var3

var1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

ggplot(dat, aes(x=var1, y=var2, color=var3)) + geom_point()

Page 31: [系列活動] Data exploration with modern R

2

4

6

8

2 3 4 5 6 7var1

var2

Setting data values to aestheticsvar1 var2 var3

2 2 53 4 05 8 47 5 1

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red")

ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red", size=6)

2

4

6

8

2 3 4 5 6 7var1

var2

Page 32: [系列活動] Data exploration with modern R

Different geoms

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_point()

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_line()

0

2

4

6

8

2 4 6var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_bar(stat="identity")

Page 33: [系列活動] Data exploration with modern R

Using multiple geoms

2

4

6

8

2 3 4 5 6 7var1

var2

ggplot(dat, aes(x=var1, y=var2)) + geom_point() + geom_line()

# Equivalent to ggplot(dat) + geom_point(aes(x=var1, y=var2)) + geom_line(aes(x=var1, y=var2))

ggplot() + geom_point(aes(x=var1, y=var2), data=dat) + geom_line(aes(x=var1, y=var2), data=dat)

Default data

Default mapping

Overridedefaults in each

geom

Page 34: [系列活動] Data exploration with modern R

Discrete Continuous

Color Rainbow of colorsGradient from light

blue to dark blue

Size Discrete size stepsLinear mapping

between radius and value

Shape Different shape for each Shouldn’t work

Page 35: [系列活動] Data exploration with modern R

0

2

4

6

A Bvar1

var3

var2●

G0G1G2

0

2

4

6

A Bvar1

var3ggplot(dat2, aes(x=var1, y=var3)) + geom_point()

Mapping discrete variables

ggplot(dat2, aes(x=var1, y=var3, color=var2)) + geom_point()

var1 var2 var3A G1 5B G0 0A G2 4B G1 1A G0 6B G2 3

Page 36: [系列活動] Data exploration with modern R

Data wrangling with modern R

Page 37: [系列活動] Data exploration with modern R

Tidyverse=

Tidy + universe

Source: https://www.flickr.com/photos/rubbermaid/7203340384 Source: http://hubblesite.org/newscenter/archive/releases/2014/27/image/a/

Page 38: [系列活動] Data exploration with modern R

faithful

as.tbl(faithful)

Tibbles

Page 39: [系列活動] Data exploration with modern R

Tidy data

A B C D A B C D

Each variable is in a column

Each observation is in a row

Page 40: [系列活動] Data exploration with modern R

Example of non-tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

Each row has 3 observations

Not Tidy

Page 41: [系列活動] Data exploration with modern R

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

Tidy

Page 42: [系列活動] Data exploration with modern R

• filter: Keep rows

• select: Keep columns

• mutate: Add new columns

• arrange: Sort rows

• summarise: Reduce variables

Page 43: [系列活動] Data exploration with modern R

# Traditional R mpg[mpg$hwy > 30, ]

# dplyr filter(mpg, hwy > 30)

Filter: get a subset of rows

Page 44: [系列活動] Data exploration with modern R

# AND filter(mpg, hwy > 30, class == "compact") filter(mpg, hwy > 30 & class == "compact")

# OR filter(mpg, hwy > 30 | class == "compact")

Filter: get a subset of rows

Page 45: [系列活動] Data exploration with modern R

%>%

Page 46: [系列活動] Data exploration with modern R

filter(mpg, hwy > 30)

mpg %>% filter(hwy > 30)

select(filter(mpg, hwy > 30), model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class)

mpg %>% filter(hwy > 30) %>% select(model, hwy, class) %>% View()

Piping with %>%

Page 47: [系列活動] Data exploration with modern R

# Traditional R mpg[, c("model", "displ", "cyl", "drv", "class", "hwy")]

# dplyr select(mpg, model, displ, cyl, drv, class, hwy)

select(mpg, -manufacturer, -fl)

Select: get a subset of columns

Page 48: [系列活動] Data exploration with modern R

# Traditional R mpg$avg <- (mpg$cty + mpg$hwy)/2

# dplyr mpg %>% mutate(avg = (cty+hwy)/2)

mpg %>% mutate( avg = (cty+hwy)/2, ratio = hwy/cty )

Mutate: add new columns

Page 49: [系列活動] Data exploration with modern R

# Traditional R mpg[order(mpg$hwy), ]

# dplyr arrange(mpg, hwy)

Arrange: sort rows

Page 50: [系列活動] Data exploration with modern R

# Traditional R mean(mpg$hwy) sd(mpg$hwy)

# dplyr summarise(mpg, hwy_m = mean(hwy))

summarise(mpg, hwy_m = mean(hwy), hwy_sd = sd(hwy), cty_m = mean(cty), cty_sd = sd(cty) )

Summarise: reduce variables

summarise ≠ summarize

Page 51: [系列活動] Data exploration with modern R

Group operations

Page 52: [系列活動] Data exploration with modern R

Why is this important?

Page 53: [系列活動] Data exploration with modern R

Summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

value

11.1

data %>% summarise(value = mean(value))

Page 54: [系列活動] Data exploration with modern R

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

subject value

1 10.3

2 9.3

3 12.1

4 12.6

data %>% group_by(subject) %>% summarise(value = mean(value))

Page 55: [系列活動] Data exploration with modern R

Group-wise summarisesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

sex condition value

F cond1 11.9

F cond2 12.5

F cond3 7.9

M cond1 12.9

M cond2 11.8

M cond3 9.7

data %>% group_by(sex, condition) %>% summarise(value = mean(value))

Page 56: [系列活動] Data exploration with modern R

Mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -3.2

1 M cond2 12.3 1.2

1 M cond3 10.7 -0.4

2 F cond1 6.3 -4.8

2 F cond2 10.6 -0.5

2 F cond3 11.1 0

3 F cond1 9.5 -1.6

3 F cond2 13.1 2

3 F cond3 13.8 2.7

4 M cond1 11.5 0.4

4 M cond2 13.4 2.3

4 M cond3 12.9 1.8

Page 57: [系列活動] Data exploration with modern R

Group-wise mutatesubject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data %>% group_by(subject) %>% mutate(norm = value - mean(value))

subject sex condition value norm

1 M cond1 7.9 -2.4

1 M cond2 12.3 2

1 M cond3 10.7 0.4

2 F cond1 6.3 -3

2 F cond2 10.6 1.3

2 F cond3 11.1 1.8

3 F cond1 9.5 -2.6

3 F cond2 13.1 1

3 F cond3 13.8 1.7

4 M cond1 11.5 -1.1

4 M cond2 13.4 0.8

4 M cond3 12.9 0.3

Page 58: [系列活動] Data exploration with modern R

Tidying data with tidyr

Page 59: [系列活動] Data exploration with modern R

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

Not Tidy

Tidy

Page 60: [系列活動] Data exploration with modern R

Converting to tidy data

subject sex cond1 cond2 cond3

1 M 7.9 12.3 10.7

2 F 6.3 10.6 11.1

3 F 9.5 13.1 13.8

4 M 11.5 13.4 12.9

gather(data, condition, value, cond1:cond3)

subject sex condition value

1 M cond1 7.9

1 M cond2 12.3

1 M cond3 10.7

2 F cond1 6.3

2 F cond2 10.6

2 F cond3 11.1

3 F cond1 9.5

3 F cond2 13.1

3 F cond3 13.8

4 M cond1 11.5

4 M cond2 13.4

4 M cond3 12.9

data

Page 61: [系列活動] Data exploration with modern R

Thank you!