Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data...

32
Data Wrangling and Visualisation in R STAT3022 Applied Linear Models Lecture 2 2020/02/14 Today 1. Using ggplot2 for data visualisation. 2. Using dplyr and tidyr for data wrangling.

Transcript of Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data...

Page 1: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Data Wrangling and Visualisation in RSTAT3022 Applied Linear Models Lecture 2

2020/02/14

Today1. Using ggplot2 for data

visualisation.

2. Using dplyr and tidyr for datawrangling.

Page 2: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Data VisualisationWhy make your graphs in R?

The graphs are easily reproducible.

You can make publication quality graphs.

How to make your graphs in R?R has many contributed packages that extend from the standard base installation.

Today we will learn about ggplot2 R package.

What is base?

Can you name some functions that are in base that generates a graph?

2 / 32

Page 3: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

ggplot2 R packageggplot2 is a powerful data visualisation R package with a largecommunity following that is built on the layered grammar of graphics byWickham (2008).

One of the reason that makes it powerful is because of its ease inextensibility resulting in many extension packages.

ggplot2 uses qplot or ggplot to make graphics

qplot is useful for making quick graphs (especially when data is not in adata.frame) but ggplot is advisable for most occasions.

We will only cover ggplot .

To get started, load the package:

library(ggplot2) # or library(tidyverse)

3 / 32Wickham (2008) Practical tools for exploring data and models. PhD Thesis.

Page 4: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Layered Grammar of GraphicsEvery ggplot2 object has three key components:

1. data,

2. A set of aesthestic mapping between variables in the data and visualproperties (e.g color, size etc)

3. At least one layer describing how to render each observation; usuallycreated with geom function.

str(iris)

'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

ggplot(data=iris) + aes(x=Sepal.Length, y=Sepal.Width) + geom_point()

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sepa

l.Width

4 / 32

Page 5: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Every layer has:1. geom - the geometric object to use display the data, and stat - statistical

transformation to use on the data for this layer.

2. data and mapping (aesthestics) which is usually inherited from ggplot()object.

3. position - position in the coordinate system.

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width))p + geom_point() # blank + geom layer

which is a short-hand for:

p + layer(geom="point", stat="identity", position="identity")

Every ggplot object has:

1. Data

2. Aesthesitc mapping

3. Layer(s)

Purpose of a layer is to display:

the raw data,

a statistical summary, or

additional metadata such ascontext, annotations, andreferences.

5 / 32

Page 6: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Some geom objectsp <- ggplot(iris, aes(Species, Sepal.Width))class(p)

[1] "gg" "ggplot"Image source:http://suruchi�aloke.com/2016-10-13-machine-learning-tutorial-iris-classi�cation/

p + geom_blank() p + geom_point() p + geom_boxplot() p + geom_violin()

6 / 32

Page 7: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Drawing linesp <- ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point(colour="gray")

p + geom_abline(intercept=-0.4,slope=0.4) p + geom_smooth(method="lm")

p + geom_hline(yintercept=0) p + geom_vline(xintercept=0)

7 / 32

Page 8: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Distribution by groupp <- ggplot(iris, aes(Petal.Width, fill=Species))

p + geom_dotplot() p + geom_histogram()

p + geom_density() p + geom_freqpoly(aes(color=Species))

8 / 32

Page 9: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

geom Description

geom_abline Reference lines: horizontal, vertical, and diagonalgeom_bar Bar chartsgeom_bin2d Heatmap of 2d bin countsgeom_blank Draw nothinggeom_boxplot A box and whiskers plot (in the style of Tukey)geom_contour 2d contours of a 3d surfacegeom_count Count overlapping pointsgeom_density Smoothed density estimatesgeom_density_2d Contours of a 2d density estimategeom_dotplot Dot plotgeom_errorbarh Horizontal error barsgeom_hex Hexagonal heatmap of 2d bin countsgeom_freqpoly Histograms and frequency polygonsgeom_jitter Jittered pointsgeom_crossbar Vertical intervals: lines, crossbars & errorbarsgeom_map Polygons from a reference mapgeom_path Connect observationsgeom_point Points

geom

9 / 32

Page 10: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Statistical Tranformationhead(iris[, c("Petal.Width", "Species")]) # raw data

Petal.Width Species 1 0.2 setosa 2 0.2 setosa 3 0.2 setosa 4 0.2 setosa 5 0.2 setosa 6 0.4 setosa

stat_bin(bins=7, mapping=aes(Petal.Width, fill=Species)) Under the hood, the raw data is transformed into statistics and this ispassed onto the geom where here geom="bar" is default.

fill y count x xmin xmax density ncount 1 #619CFF 0 0 0.0 -0.2 0.2 0.0 0.0000000 2 #00BA38 0 0 0.0 -0.2 0.2 0.0 0.0000000 3 #F8766D 34 34 0.0 -0.2 0.2 1.7 1.0000000 4 #619CFF 0 0 0.4 0.2 0.6 0.0 0.0000000 5 #00BA38 0 0 0.4 0.2 0.6 0.0 0.0000000 6 #F8766D 16 16 0.4 0.2 0.6 0.8 0.4705882

10 / 32

Page 11: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Using stat with different geom objectp <- ggplot(iris, aes(Petal.Width, fill=Species))

p + stat_bin() p + stat_bin(geom="bar")

p + stat_bin(geom="point") p + stat_bin(geom="line")

11 / 32

Page 12: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

stat Description

stat_count Bar chartsstat_bin_2d Heatmap of 2d bin countsstat_boxplot A box and whiskers plot (in the style of Tukey)stat_contour 2d contours of a 3d surfacestat_sum Count overlapping pointsstat_density Smoothed density estimatesstat_density_2d Contours of a 2d density estimatestat_bin_hex Hexagonal heatmap of 2d bin countsstat_bin Histograms and frequency polygonsstat_qq_line A quantile-quantile plotstat_quantile Quantile regressionstat_smooth Smoothed conditional meansstat_spoke Line segments parameterised by location, direction anddistance stat_ydensity Violin plotstat_sf Visualise sf objectsstat_ecdf Compute empirical cumulative distributionstat_ellipse Compute normal con�dence ellipsesstat_function Compute function for each x value

stat

12 / 32

Page 13: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Customisation

There are so many ways to customise a ggplot.

13 / 32

Page 14: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Changing ColorThere are many color palettes available, e.g.

library(RColorBrewer)ggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_brewer(palette="Set3")

14 / 32

Page 15: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Grey-scaleggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_grey()

Manual scaleggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_manual( values=c("red","blue", "green"), labels=c("setosa", "versicolor", "virginica"))

15 / 32

Page 16: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Color variable is factorggplot(iris, aes(Petal.Width, Petal.Length, color=Species)) + geom_point(size=2) + scale_color_brewer(palette="Set1")

Color variable iscontinuousggplot(iris, aes(Petal.Width, Petal.Length, color=Sepal.Length)) + geom_point(size=2) + scale_color_distiller(palette="YlGnBu")

16 / 32

Page 17: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Data Wrangling

You may need to wrangle the data to get it in the right form for ggplot (or other purposes).

17 / 32

Page 18: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

library(agridat) # data is inside herelibrary(dplyr) # for data wrangling; loaded together with library(tidyverse)str(pearl.kernels) # or glimpse(pearl.kernels)

'data.frame': 59 obs. of 6 variables: $ ear: Factor w/ 4 levels "Ear08","Ear09",..: 1 1 1 1 1 1 1 1 1 1 ... $ obs: Factor w/ 15 levels "Obs01","Obs02",..: 1 2 3 4 5 6 7 8 9 10 ... $ ys : int 352 322 298 332 305 313 308 311 327 308 ... $ yt : int 102 49 75 101 101 100 86 101 101 92 ... $ ws : int 52 82 108 71 86 90 95 92 78 95 ... $ wt : int 26 79 51 28 40 29 43 28 26 37 ...

We are using the data pearl.kernels loaded from library(agridat) .

The data contains the counts of yellow/white and sweet/starchy kernels oneach of 4 maize ears by 15 observers.

I want to get the counts for the 8th maize ear by observer 1 (plantpathologist).

(ear8obs1 <- pearl.kernels %>% filter(ear=="Ear08" & obs=="Obs01"))

ear obs ys yt ws wt 1 Ear08 Obs01 352 102 52 26

Image source: http://corncommentary.com/2012/05/22/using-the-kfc-kernel-for-cellulosic/

18 / 32Pearl, Raymond (1911) The Personal Equation In Breeding Experiments Involving Certain Characters of Maize Biological Bulletin 21 339-366

Page 19: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Help!The data:

ear8obs1

ear obs ys yt ws wt 1 Ear08 Obs01 352 102 52 26

How do I make the below graph in ggplot?

ggplot(ear8obs1, aes(x=..., y=...)) + geom_bar()

What if the data was shaped as below?

Type Count Color Kernel 1 ys 352 Yellow Starchy 2 yt 102 Yellow Sweet 3 ws 52 White Starchy 4 wt 26 White Sweet

How do I get the data in this shape easily?

ear8obs1 %>% tidyr::gather("Type", "Count", ys:wt)

ear obs Type Count 1 Ear08 Obs01 ys 352 2 Ear08 Obs01 yt 102 3 Ear08 Obs01 ws 52 4 Ear08 Obs01 wt 26

19 / 32

Page 20: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

]Data WranglingGet the counts for the 8th maize ear by observer 1 (plant pathologist):

maize <- pearl.kernels %>% filter(ear=="Ear08" & obs=="Obs01") %>% select(ys, yt, ws, wt) %>% tidyr::gather("Type", "Count", ys:wt) %>% mutate(Color=case_when( Type %in% c("ys", "yt") ~ "Yellow", Type %in% c("ws", "wt") ~ "White" ),Kernel=case_when( Type %in% c("ys", "ws") ~ "Starchy", Type %in% c("yt", "wt") ~ "Sweet"))maize

Type Count Color Kernel 1 ys 352 Yellow Starchy 2 yt 102 Yellow Sweet 3 ws 52 White Starchy 4 wt 26 White Sweet

20 / 32

Page 21: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Example: Observer 1 for Maize Ear 8ggplot(maize, aes(Kernel, Count, fill=Color)) + geom_bar(stat="identity", color="black") + scale_fill_manual(values=c("white", "yellow"), label=c("White", "Yellow")) + guides(fill=FALSE) + theme_minimal(base_size = 20)

Image Source:https://agrifarmingtips.com/maize-cultivation-process/

21 / 32

Page 22: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Position for geom_bar which include stat="identity"p2 <- ggplot(maize, aes("",Count,fill=Type))

p2 + geom_bar() p2 + geom_bar(position="stack")

p2 + geom_bar(position="dodge") p2 + geom_bar(position="fill")

22 / 32

Page 23: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Coordinate system

p + geom_bar() p + geom_bar() + coord_polar(theta="y")

p + geom_bar() + coord_flip() p + geom_bar() + coord_polar(theta="y", direction=-1)

23 / 32All geom_bar include the arguments stat="identity" and color="black" .

Page 24: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Overplottingg <- ggplot(pearl.kernels, aes(ear, ys, color=ear, size=1,shape=)) + xlab(NULL) + guides(color=FALSE, size=FALSE) + ylab("No. of Yellow\n Starchy Kernel")

g + geom_point() g + geom_point(position="jitter")

g + geom_point(alpha=1 / 3) g + geom_point(alpha=1 / 6)

24 / 32

Page 25: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Massaging data to tidy formmaize_all <- pearl.kernels %>% tidyr::gather("Type", "Count", ys:wt) %>% mutate(Color=ifelse(substr(Type, 1, 1)=="y", "Yellow", "White"), Kernel=ifelse(substr(Type, 2, 2)=="s", "Starchy", "Sweet"), obs=factor(as.integer(substring(obs, 4, 5))))

head(pearl.kernels)

ear obs ys yt ws wt1 Ear08 Obs01 352 102 52 262 Ear08 Obs02 322 49 82 793 Ear08 Obs03 298 75 108 514 Ear08 Obs04 332 101 71 285 Ear08 Obs05 305 101 86 406 Ear08 Obs06 313 100 90 29

head(maize_all)

ear obs Type Count Color Kernel1 Ear08 1 ys 352 Yellow Starchy2 Ear08 2 ys 322 Yellow Starchy3 Ear08 3 ys 298 Yellow Starchy4 Ear08 4 ys 332 Yellow Starchy5 Ear08 5 ys 305 Yellow Starchy6 Ear08 6 ys 313 Yellow Starchy

25 / 32

Page 26: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Facetingggplot(maize_all, aes(obs, Count, fill=Type)) + geom_bar(stat="identity") + xlab("Observer") + facet_wrap(~ear)

 ear8 <- maize_all %>% filter(ear=="Ear08") %>% ggplot(aes(obs, Count, fill=Type)) + geom_bar(stat="identity", show.legend=F) + labs(tag="(A)", title="Ear 8", x="Observer") + facet_grid(Color ~ Kernel)

26 / 32

Page 27: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Patching Plots Togetherlibrary(patchwork)ear8 + ear9 + ear10 + ear11 + plot_layout(ncol = 2)

27 / 32

Page 28: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Changing Labelsg <- ggplot(vargas.wheat1.traits, aes(NGS, yield)) + geom_point(size=3) + geom_point(aes(colour=gen)) + geom_smooth(se=F, method="lm") + facet_wrap(~year) + labs(colour="Genotype") + # changes the label name for color legend labs(x="Number of grains per spikelet") + # same as xlab(..) labs(y="Yield (kg/ha)") + # same as ylab(..) labs(title="Durum Wheat at Ciudad Obregon, Mexico 1990-1995") + # same as ggtitle(..) labs(subtitle="Source: Vargas et al. (1998) Interpreting Genotyp # same as ggtitle(subtitle=..)

28 / 32

Page 29: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Theme - customise the lookg + theme(legend.position="bottom", plot.title=element_text(face="bold", size=15),plot.subtitle=element_text(face="italic", size=8),panel.background=element_rect(fill="white"),panel.border=element_rect(colour="grey20", fill=NA),panel.grid=element_line(colour="grey92"),panel.grid.minor=element_line(size=rel(0.5)),strip.background=element_rect(fill="grey85", colour="grey20"),legend.key=element_rect(fill="white"))

29 / 32

Page 30: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

Theme - customise the lookg + theme(legend.position="bottom", plot.title=element_text(face="bold", size=15),plot.subtitle=element_text(face="italic", size=8),panel.background=element_rect(fill="white"),panel.border=element_rect(colour="grey20", fill=NA),panel.grid=element_line(colour="grey92"),panel.grid.minor=element_line(size=rel(0.5)),strip.background=element_rect(fill="grey85", colour="grey20"),legend.key=element_rect(fill="white"))

or use a pre-de�ned theme:

g + theme_bw() +theme(legend.position="bottom", plot.title=element_text(face="bold", size=14),plot.subtitle=element_text(face="italic", size=8))

30 / 32

Page 31: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

More Pre-De�ned Themesg + theme_gray()

g + theme_classic()

 g + theme_minimal()

g + theme_dark()

31 / 32

Page 32: Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data visualisation R package with a lar ge community following that is built on the layered grammar

SummaryUsing functions such as filter and mutate from dplyr towrangle data.

Using function gather from tidyr to change the data fromwide to long form.

Using ggplot from ggplot2 to make many sorts of plots.

Next lessonRevisitng simple linear regression.

Maximum likelihood estimation.

32 / 32