[系列活動] Data exploration with modern R
-
date post
21-Apr-2017 -
Category
Data & Analytics
-
view
2.850 -
download
1
Transcript of [系列活動] Data exploration with modern R
Exploring data with modern R
Winston Chang RStudio
2016–12–21
https://hea-www.harvard.edu/~fine/Observatory/women.html
Modern R
A brief history of R• In the beginning, there was S. Developed at
Bell Labs in the 1970’s.• S was owned and licensed by AT&T• In 1990’s, two professors from New Zealand
created a free, open source reimplementation of S, called R
• Many of the unusual features of R exist because they came from S
• R itself is somewhat different from S and has a very flexible syntax
install.packages("tidyverse") # Automatically installs ggplot2, dplyr, tidyr, # and others.
library(tidyverse) tidyverse_update() # Update all tidyverse pacakges to the latest # version.
The tidyverse
Getting started
faithful
head(faithful)
str(faithful)
View(faithful) # In RStudio
Looking at data with R
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
50
60
70
80
90
2 3 4 5eruptions
waiting
library(ggplot2)
ggplot(data=faithful, mapping=aes(x=eruptions, y=waiting)) + geom_point()
# More concisely: ggplot(faithful, aes(eruptions, waiting)) + geom_point()
0
5
10
15
20
25
2 3 4 5eruptions
count
0
10
20
30
40
2 3 4 5eruptions
count
ggplot(faithful, aes(x=eruptions)) + geom_histogram()
ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=.25)
Your turnInspect the diamonds data set.With diamonds, make a histogram of the carat variable. Experiment with different bin sizes. What patterns do you see?
Inspect the mpg data set. With mpg, make a scatter plot showing the relationship between displ and hwy.
0
5000
10000
15000
0 1 2 3 4 5carat
count
ggplot(diamonds, aes(x=carat)) + geom_histogram()
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.3)
0
5000
10000
15000
0 1 2 3 4 5carat
count
0
5000
10000
0 1 2 3 4 5carat
count
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.25)
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.01)
0
1000
2000
0 1 2 3 4 5carat
count
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
20
30
40
2 3 4 5 6 7displ
hwy
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)
10
20
30
40
2 3 4 5 6 7displ
hwy
head(mpg)
str(mpg)
View(mpg)
ggplot(mpg, aes(x=displ, y=hwy, color=drv)) + geom_point()
20
30
40
2 3 4 5 6 7displ
hwy
drv4
f
r
20
30
40
2 3 4 5 6 7displ
hwy
class2seater
compact
midsize
minivan
pickup
subcompact
suv
ggplot(mpg, aes(x=displ, y=hwy, color=class)) + geom_point()
Your turn
What happens if you use shape instead of color?
Run ?geom_smooth to see the documentation. Then remove the confidence region from the model line. What happens if you add a model line and map a variable to color?
Faceting
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_wrap(~class)
suv
minivan pickup subcompact
2seater compact midsize
2 3 4 5 6 7
2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
20
30
40
20
30
40
displ
hwy
●●●●
●●
●● ●
●
●
●●●
●
●●●●
●
●●
●●
●
●●●
●●
●●
●●
●●●
●
●●●●●●●●
●●
●●
●●
●●●●
●
●●●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●● ●●
●●
●●●
●●●●●●●
●
●●●●●●
●
●●●●
●●●●●●●●●●●
●●●●
●●●
●●●●
●●
●●
●●●
● ●
●●
●
● ●●●
●●
●
●●●
●●●
●●
●●
●●●●● ●
●
●
●
●
●●
●
●
●●
●
●
●●
●●●
●
●●
●
●
●●
●
●●
●
●●●
●
●●
●●
● ●●
●●
●●●
●●
●●●●
●
●
●
●●
●●
●●
●●●●
●●
●
●
●●
●
4 5 6 8
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
displ
hwy
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(. ~ cyl)
●●
●●
●● ●●●
●●
●
●●
●●●●●
●●
●
●●
● ●
●
●●
●
●●
●
●●●
●
●●
●●
●●●● ●
●●●●●●
●●
●
●●
●
●
●●
●●
●●
●●● ●
●●●
●●
●●
●●●
●
●● ●●●●●●
●●●●
●
●●
●●●●
●●
●●
●●●●
●●●●
●
●
●
●● ●
●●●●
●
●●●●
●●●
●
●●●●
●
●●
●●
●●●
●
●●●
●●●
●●
●●
●●●
● ●● ●
●●
●
●●
●●
●●●
●●●●
●●●
●
●●●●
●
●
●●
●
●
●
●
●● ●●
●●
●
●
●
●●●●●
●●
●● ●
●
●
●
● ●
●
●
●●
●
● ●●
●●●●
●●●●
●
●●●
4f
r
2 3 4 5 6 7
20
30
40
20
30
40
20
30
40
displ
hwy
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ .)
●●
●●
●●
●●●
●
●●●●●●●●
●●●●●
●●●●
●
●
●
●●●
●
●●●●
●
●●
●●
●
●●●
●●
●●
●●
●●
●●●●
●
●●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●● ●●
●●
●●●●●●
●●●●●●●●●●●
●
●●
●●
●●
●●●
●
●●
●●
●●●●
●
●●
●●●●
●
●●●●
●●●
●●●
●●●
● ●● ●
●●
●●●
●●●
●●●●● ●
●●●●
●
●
●●
●●●
●
●●
●
●
●●
●
●●
●
●●●
●
●●
●●
●●
●●●
●●●
●
●●
●●
●●
●
●●
●●●
●
●
●
●
●
●●
●
●
●●
●
● ●●
●●●●
●
●●●
4 5 6 8
4f
r
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
20
30
40
20
30
40
displ
hwy
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(drv ~ cyl)
cyl
drv
Your turn
Try faceting with a histogram
ggplot2 concepts
Geoms
Points
Lines
Bars
Error bars
Box plot
Aesthetics
Y position
X position
Color
Size
Aesthetics
Mapping data values to aesthetics
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
012345
var3
var1 var2 var3
2 2 53 4 05 8 47 5 1
ggplot(dat, aes(x=var1, y=var2)) + geom_point()
ggplot(dat, aes(x=var1, y=var2, color=var3)) + geom_point()
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
Setting data values to aestheticsvar1 var2 var3
2 2 53 4 05 8 47 5 1
ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red")
ggplot(dat, aes(x=var1, y=var2)) + geom_point(color="red", size=6)
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
Different geoms
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
ggplot(dat, aes(x=var1, y=var2)) + geom_point()
2
4
6
8
2 3 4 5 6 7var1
var2
ggplot(dat, aes(x=var1, y=var2)) + geom_line()
0
2
4
6
8
2 4 6var1
var2
ggplot(dat, aes(x=var1, y=var2)) + geom_bar(stat="identity")
Using multiple geoms
●
●
●
●
2
4
6
8
2 3 4 5 6 7var1
var2
ggplot(dat, aes(x=var1, y=var2)) + geom_point() + geom_line()
# Equivalent to ggplot(dat) + geom_point(aes(x=var1, y=var2)) + geom_line(aes(x=var1, y=var2))
ggplot() + geom_point(aes(x=var1, y=var2), data=dat) + geom_line(aes(x=var1, y=var2), data=dat)
Default data
Default mapping
Overridedefaults in each
geom
Discrete Continuous
Color Rainbow of colorsGradient from light
blue to dark blue
Size Discrete size stepsLinear mapping
between radius and value
Shape Different shape for each Shouldn’t work
●
●
●
●
●
●
0
2
4
6
A Bvar1
var3
var2●
●
●
G0G1G2
●
●
●
●
●
●
0
2
4
6
A Bvar1
var3ggplot(dat2, aes(x=var1, y=var3)) + geom_point()
Mapping discrete variables
ggplot(dat2, aes(x=var1, y=var3, color=var2)) + geom_point()
var1 var2 var3A G1 5B G0 0A G2 4B G1 1A G0 6B G2 3
Data wrangling with modern R
Tidyverse=
Tidy + universe
Source: https://www.flickr.com/photos/rubbermaid/7203340384 Source: http://hubblesite.org/newscenter/archive/releases/2014/27/image/a/
faithful
as.tbl(faithful)
Tibbles
Tidy data
A B C D A B C D
Each variable is in a column
Each observation is in a row
Example of non-tidy data
subject sex cond1 cond2 cond3
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
Each row has 3 observations
Not Tidy
Converting to tidy data
subject sex cond1 cond2 cond3
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
subject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
Not Tidy
Tidy
• filter: Keep rows
• select: Keep columns
• mutate: Add new columns
• arrange: Sort rows
• summarise: Reduce variables
# Traditional R mpg[mpg$hwy > 30, ]
# dplyr filter(mpg, hwy > 30)
Filter: get a subset of rows
# AND filter(mpg, hwy > 30, class == "compact") filter(mpg, hwy > 30 & class == "compact")
# OR filter(mpg, hwy > 30 | class == "compact")
Filter: get a subset of rows
%>%
filter(mpg, hwy > 30)
mpg %>% filter(hwy > 30)
select(filter(mpg, hwy > 30), model, hwy, class)
mpg %>% filter(hwy > 30) %>% select(model, hwy, class)
mpg %>% filter(hwy > 30) %>% select(model, hwy, class) %>% View()
Piping with %>%
# Traditional R mpg[, c("model", "displ", "cyl", "drv", "class", "hwy")]
# dplyr select(mpg, model, displ, cyl, drv, class, hwy)
select(mpg, -manufacturer, -fl)
Select: get a subset of columns
# Traditional R mpg$avg <- (mpg$cty + mpg$hwy)/2
# dplyr mpg %>% mutate(avg = (cty+hwy)/2)
mpg %>% mutate( avg = (cty+hwy)/2, ratio = hwy/cty )
Mutate: add new columns
# Traditional R mpg[order(mpg$hwy), ]
# dplyr arrange(mpg, hwy)
Arrange: sort rows
# Traditional R mean(mpg$hwy) sd(mpg$hwy)
# dplyr summarise(mpg, hwy_m = mean(hwy))
summarise(mpg, hwy_m = mean(hwy), hwy_sd = sd(hwy), cty_m = mean(cty), cty_sd = sd(cty) )
Summarise: reduce variables
summarise ≠ summarize
Group operations
Why is this important?
Summarisesubject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
value
11.1
data %>% summarise(value = mean(value))
Group-wise summarisesubject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
subject value
1 10.3
2 9.3
3 12.1
4 12.6
data %>% group_by(subject) %>% summarise(value = mean(value))
Group-wise summarisesubject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
sex condition value
F cond1 11.9
F cond2 12.5
F cond3 7.9
M cond1 12.9
M cond2 11.8
M cond3 9.7
data %>% group_by(sex, condition) %>% summarise(value = mean(value))
Mutatesubject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
data %>% mutate(norm = value - mean(value))
subject sex condition value norm
1 M cond1 7.9 -3.2
1 M cond2 12.3 1.2
1 M cond3 10.7 -0.4
2 F cond1 6.3 -4.8
2 F cond2 10.6 -0.5
2 F cond3 11.1 0
3 F cond1 9.5 -1.6
3 F cond2 13.1 2
3 F cond3 13.8 2.7
4 M cond1 11.5 0.4
4 M cond2 13.4 2.3
4 M cond3 12.9 1.8
Group-wise mutatesubject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
data %>% group_by(subject) %>% mutate(norm = value - mean(value))
subject sex condition value norm
1 M cond1 7.9 -2.4
1 M cond2 12.3 2
1 M cond3 10.7 0.4
2 F cond1 6.3 -3
2 F cond2 10.6 1.3
2 F cond3 11.1 1.8
3 F cond1 9.5 -2.6
3 F cond2 13.1 1
3 F cond3 13.8 1.7
4 M cond1 11.5 -1.1
4 M cond2 13.4 0.8
4 M cond3 12.9 0.3
Tidying data with tidyr
Converting to tidy data
subject sex cond1 cond2 cond3
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
subject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
Not Tidy
Tidy
Converting to tidy data
subject sex cond1 cond2 cond3
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
gather(data, condition, value, cond1:cond3)
subject sex condition value
1 M cond1 7.9
1 M cond2 12.3
1 M cond3 10.7
2 F cond1 6.3
2 F cond2 10.6
2 F cond3 11.1
3 F cond1 9.5
3 F cond2 13.1
3 F cond3 13.8
4 M cond1 11.5
4 M cond2 13.4
4 M cond3 12.9
data
Thank you!