Intro to GGPLOT2 - University of Alberta · 2019-03-20 · 300114 301449302343 302369 302789 304155...

Intro to GGPLOT2March 19, 2019

In this lab you will learn to visualize raw data by plotting exploratory graphics with ggplot2 package. Unlikefinal graphs for publication or thesis, exploratory graphics are usually made very quickly in the process ofchecking for errors, outliers, distribution, and correlations of variables. The goal of making graphs is usuallydeveloping a personal understanding of the data and to prioritize tasks for follow up analysis.

Grammar of ggplot2

ggplot2 , is designed for data visualization of data frames. ‘gg’ of the name represents ‘grammar of graphic’,and ggplot2 has been recognized as one of three main graphic system of R.

The most important thing to get used to with ggplot2 is the logical structure of plots. The code you writespecifies the connections between the variables in your data, and the x and y location, colors, size, shapesetc that you can see on the screen. In ggplot2, these logical connections between your data and the plotelements are called aesthetic mappings or just aesthetics.

You begin every plot by telling the ggplot() function what your data is, and then how the variables in thisdata logically map onto the plot’s aesthetic mapping. Then you take the result and say what general sort ofplot you want, such as a scatterplot, a boxplot, or a bar chart. In ggplot2, the overall type of plot is called ageom. Each geom has a function that creates it and the function’s name follows the pattern of “geom_…()” .For example, geom_point() makes scatterplots, geom_bar() makes bar plots, geom_boxplot() makes boxplots,and so on. You combine these two pieces, the ggplot(data, mapping) object and the geom_…(), by literallyadding them together in an expression, using the + symbol.

Data, mapping (or aesthetics), and geometry (geom) are three mandatory components for ggplot2. As otherfunctions, the output of ggplot2 can be assigned to an object for further editing. Other optional ggplot2grammar components can be found in ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham(2016).

A little too complex? Don’t worry; you will get familiar with the grammar system very soon. In this lab, we

1

will use this ggplot2 syntax to plot the following exploratory graphics: histogram (density plot), boxplot,scatterplot(including scatterplot matrix), and barplot.

Data preparation

• For this exercise, use a weather station dataset “AB_Stations.csv” that you can download from thecourse website. The first three columns specify the weather station ID, as well as the ecosystems andthe biome of Alberta in which the weather station is located. This is followed by a number of climatevariables that you can use for exploration (MAT=mean annual temp, MWMT= mean warmest monthtemp, MCMT=mean coldest month temp, MAP=mean annual precipitation, MSP=mean summerprecipitation, DRYNESS=an index).

• Load required packages.

#install.package('ggplot2') # if no ggplot2 package installedlibrary(ggplot2)

• Import the dataset with the code below, and use head(), tail(), str()or View() functions to check theimported data table.

dat1 <- read.csv("AB_Stations.csv")head(dat1, 10)## STATION ECOSYS BIOME MAT MWMT MCMT MAP MSP DRYNESS## 1 300114 G-NF Grassland 2.4 17.5 -24.9 443 287 17## 2 301449 G-DMG Grassland 4.5 18.8 -23.1 415 257 3## 3 302343 G-MG Grassland 4.9 18.4 -23.3 429 258 12## 4 302369 G-DMG Grassland 5.1 18.3 -24.3 405 254 1## 5 302789 G-NF Grassland 2.8 17.4 -22.2 431 291 14## 6 304155 B-AP Boreal -0.9 17.4 -30.4 480 292 45## 7 304642 B-UBH Boreal -0.5 16.3 -29.0 511 330 52## 8 305076 M-M Montane 2.8 15.1 -21.3 550 319 42## 9 305773 B-CP Boreal 3.3 17.3 -24.5 488 309 22## 10 306733 B-KU Boreal -3.5 16.5 -32.3 431 260 60

5.1. Histograms

One useful plot type for exploration of raw data is histograms. They are commonly used to visually checkthe distribution of continuous variables. The geom of histogram is geom_histogram(). For histograms the yaxis is counting the number of observations in each bin (default of ggplot2), but y can also be set as density.

• According to the ggplot2 syntax, we can execute the following command to get a histogram for avariable, in this case the variable “DRYNESS”:

hist_a <- ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(color = 'gray90')hist_a

ggplot2 chooses the bin width by default when generating histograms, but chances are that bin width is notthe most appropriate one for any histogram you may want to make. It is therefore critical to change bins toverify whether the resulting histogram reflects the data accurately. Too many bins makes histograms overlypeaky and losses the whole picture of distribution, while too few bins cover limited details of the distribution.The following two methods to change bins:

2

1. First method: set the number of bins you want for the histogram;

ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 5, color = 'gray90') #5 binsggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 20, color = 'gray90')#20 binggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 50, color = 'gray90')#50 bin

0

20

40

60

0 50 100 150

DRYNESS

coun

t

0

5

10

15

20

0 50 100

DRYNESS

coun

t

0

5

10

0 50 100

DRYNESS

coun

t

2. Second method: set the width of bins:

ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 1, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 5, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 10, color = 'gray90')

0

5

10

0 50 100

DRYNESS

coun

t

0

5

10

15

0 50 100

DRYNESS

coun

t

0

10

20

30

0 50 100

DRYNESS

coun

t

• Great to visually check the effectiveness of data transformations. In this case, the square-root trans-formation achieves approximately a normal distribution.

hist_b <- ggplot(dat1, aes(x = sqrt(DRYNESS))) + geom_histogram(color = 'gray90')hist_b

• You can also fill colors of the bins by group/class. In many scenarios we have multiple distributionswe would like to visualize simultaneously. For example, were the biomes having similar dryness situa-tion? One commonly employed visualization strategy is stacking bars on top of each other and fillinghistogram in different colors for groups;

hist_c <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram()

• Although counting numbers is used as y axis by default, you can change y axis as density. Givenuneven sample size for each group/class, density histograms may show inconsistent pattern with thefrequency ones.

3

hist_d <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram(aes(y = ..density..))#specify y as density

0

5

10

15

0 50 100

DRYNESS

coun

t

BIOME

Boreal

Grassland

Montane

hist_c

0.00

0.02

0.04

0.06

0.08

0 50 100

DRYNESS

dens

ity

BIOME

Boreal

Grassland

Montane

hist_d

• One biggest disadvantage of the stacked histogram is hard to quantify each group; for example, howmany samples of Boreal with DRYNESS values around 40? About 15 or 8? It’s not super clear tocompare distributions among groups. To solve this, one way is to change the positions of bins. Dodgingbins preserves the vertical position of a geom while adjusting the horizontal position.

• Histograms have been a popular visualization option since at least the 18th century, in part because theyare easily generated by hand. More recently, as extensive computing power has become popularized ineveryday devices such as laptops and cell phones, we see them increasingly being replaced by densityplots. In a density plot, we attempt to visualize the underlying probability distribution of the data bydrawing an appropriate continuous curve;

0.0

2.5

5.0

7.5

10.0

12.5

0 50 100

DRYNESS

coun

t

BIOME

Boreal

Grassland

Montane

0.00

0.01

0.02

0.03

0.04

0 50 100

DRYNESS

dens

ity

BIOME

Boreal

Grassland

Montane

Similarly, we fill density curve with different colors. The alpha argument is used to introduce transparencyof the color, and alpha value in the range of 0 (totally transparent) to 1 (no transparent). Also, try to addmultiple geom:

hist_f + geom_histogram(aes(y = ..density..), alpha = 0.6, position = 'dodge')

5.2. Scatter plots

With scatter plots you can visually check the relationships among variables. Are they linear or curvilinear?Outliers are also easily visible.

• Now, try to use a scatter plot to visually check the relationships among variables and to identifyoutliers. To check the relationships between Mean Summer Precipitation (MSP) and Mean Annual

4

Precipitation (MAP), we can use them as x, y in the plot respectively (normally, y axis is for dependentvariable, and x is for independent variable, but in this case it’s ok to exchange axes). The geom functionfor scatter plot is geom_point()

ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point()

• Cool! It seems there is a positive relationship between these two variables. However, several overlappedpoints could influence the interpretability of the plot. One easy way is to introduce the transparencyof points.

plt <- ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point(alpha = .3)plt

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

• Besides changing the transparency, changing the point position by counterintuitively adding randomnoise could be helpful to see each point.

plt_jittered <- ggplot(dat1, aes(x = MAP, y = MSP)) +geom_point(position = "jitter")

plt_jittered

• You can also add labels to your plot with the geom function geom_text().In this case, we want to labelSTATION name of points. hjust and vjust is used to control the placement of labels.

plt_label <- ggplot(dat1, aes(x = MAP, y = MSP, label = STATION)) +geom_point() + geom_text(hjust = 0, vjust = 0, size = 2.2, color = 'gray40')

plt_label

5

300114

301449302343302369

302789304155

304642

305076

305773

306733

307370

307476

308940

309125

309275

311653

312849313133

313617

314291

314428

314549

315571

316679317166

317380

317402317973

318467

318820

320886

321054

321629

321981322685322819

323397

323860

324975

325083

326204

326346

328570

329232

330507

330987

331250

331666

332997

333387

333684333941

334604

335494

335543336145

336582

338128

339025

339231

339256340053

340251

340921

342611

343767

344513

345289

346766

348363

348868

350956

352795

352817

354141

354656

355248

356746

357642

361092

361136

361336

361368

361782

361916

363855

363994364112

364909

366384

366906

367460

367544

367548

369889371685

373633

373664

373935

374134

376822379433

379739

381553

381684 381902

382286384538

384579

384757

384942

387607388064

388754

388919

389756391351

393944

394461

394585

394756

394953

395114

396675

397640

397981

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

Could you tell the STATION ID of the two outliers around the lower right corner of the plot?

• However, do all BIOME types follow the same relationship between MAP and MSP? To figure it out,we need add some visual aid to separate these types (e.g., color, shape)

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

BIOME

Boreal

Grassland

Montane

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

5.3 Box plots

• Just like scatter plots, boxplots is a good way to visually check the relationships among two variables.If one variable is continuous (as y) and the other is categorical (as x), then boxplot is a good option.For instance, to understand the general distribution of mean annual temperature (MAT) of each biometypes (BIOME). The geom for boxplot is geom_boxplot()

ggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot()

If you still have time, you can add the following arguments within the parentheses of geom_boxplot()andsee what their functions could be: varwidth = T, notch = T

ggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(varwidth = T)# the width of box reflecting the samples sizeggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(notch = T) # add notch to box(es)

6

−2.5

0.0

2.5

5.0

Boreal Grassland Montane

BIOME

MAT

−2.5

0.0

2.5

5.0


BIOME

MAT

Similarly, we can make boxplot for mean annual temperature (MAT) of ecosystems (ECOSYS).

ggplot(dat1, aes(x = ECOSYS, y = MAT)) + geom_boxplot()

−2.5

0.0

2.5

5.0

B−APB−BSAB−CMB−CPB−DMB−KUB−LBHB−NMB−PeacB−PRPB−UBHG−DMGG−FFG−FPG−MGG−NFM−AM−LFM−MM−SAM−UF

ECOSYS

MAT

Since the names of ecosystems takes spaces and can easily overlap, we prefer to use ECOSYS as y axis andhave horizontal boxplot:

ggplot(dat1, aes(x = ECOSYS, y = MAT)) +geom_boxplot() + coord_flip() # horizontal: flip the x, y axes

# Great! Now you can color boxplots based their BIOME types;ggplot(dat1, aes(x = ECOSYS, y = MAT, fill = BIOME)) +geom_boxplot(varwidth = T) + coord_flip() # colored by BIOME groups

7

B−AP

B−BSA

B−CM

B−CP

B−DM

B−KU

B−LBH

B−NM

B−Peac

B−PRP

B−UBH

G−DMG

G−FF

G−FP

G−MG

G−NF

M−A

M−LF

M−M

M−SA

M−UF

−2.5 0.0 2.5 5.0

MAT

EC

OS

YS

a

B−AP

B−BSA

B−CM

B−CP

B−DM

B−KU

B−LBH

B−NM

B−Peac

B−PRP

B−UBH

G−DMG

G−FF

G−FP

G−MG

G−NF

M−A

M−LF

M−M

M−SA

M−UF

−2.5 0.0 2.5 5.0

MAT

EC

OS

YS

BIOME

Boreal

Grassland

Montane

b

• Boxplots are generally useful, but it does only focus on five numbers of the samples (min, max, 25th,50th, and 75th). To add more details about distribution, we can add points (jittered) or violin plot asalternatives.

ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_violin()+ geom_boxplot(width = .1)

ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_boxplot() + geom_point(position='jitter', alpha=.2, size=2)

−2.5

0.0

2.5

5.0


BIOME

MAT

−2.5

0.0

2.5

5.0

7.5


BIOME

MAT

The first commend narrowed the width of boxplots and added them into violin plot, and the second oneadded scatter plots into boxplots.

8

• Well done! So far we just analyze one continuous variable once a time. Can we visual multiple variablesin one plot?

First, we applied gather() function to transform a data frame from wide to long. In ggplot2, x or y mustonly be determined by one single variable. Therefore, first we need to gather multiple interested variablesinto one, and then use the new data table for ggplot2 plotting. For example, let make a boxplot to check thedistribution of three BIOME types of mean annual temperature (MAT), mean warmest month temperature(MWMT) and mean coldest month temperature (MCMT).

# install.packages('tidyr')library(tidyr)dat2 <- gather(dat1, key = 'temp', value = 'value', MAT, MCMT, MWMT)head(dat2, 3) #quick check the new data table## STATION ECOSYS BIOME MAP MSP DRYNESS temp value## 1 300114 G-NF Grassland 443 287 17 MAT 2.4## 2 301449 G-DMG Grassland 415 257 3 MAT 4.5## 3 302343 G-MG Grassland 429 258 12 MAT 4.9ggplot(dat2, aes(x = temp, y = value, fill = BIOME)) +geom_boxplot() #using different colors for BIOME types

−20

0

20

MAT MCMT MWMT

temp

valu

e

BIOME

Boreal

Grassland

Montane

Looks nice! If you change the temperature variables as treatments, then different performance among andwithin groups is a strong clue of interaction.

5.4. Multi-panel scatter plots in R

So far, in this lab we learnt 1-dimensional (histogram, density plot), 2-dimensional (scatter plot, boxplot)exploratory graphics, they normally can only analyze one or a pair of variables a time. If you have a datatable with 10 potential independent variables, plotting them one by one is not effective. To get the generalidea of the relationships among variables in very short time:

#need use ggpairs() function of GGally package#install.packages('GGally')library(GGally)ggpairs(dat1[, c('MAT', 'MAP', 'MSP', 'DRYNESS', 'BIOME')],

aes(color = BIOME))

9

Cor : −0.177

Boreal: 0.245

Grassland: −0.206

Montane: −0.91

Cor : −0.0431

Boreal: 0.311

Grassland: −0.374

Montane: −0.171

Cor : 0.711

Boreal: 0.547

Grassland: 0.54

Montane: −0.0371

Cor : −0.735

Boreal: −0.727

Grassland: −0.51

Montane: −0.908

Cor : 0.667

Boreal: 0.117

Grassland: 0.745

Montane: 0.892

Cor : 0.492

Boreal: 0.123

Grassland: 0.818

Montane: 0.189

MAT MAP MSP DRYNESS BIOME

MAT

MA

PM

SP

DR

YN

ES

SB

IOM

E

−5.0 −2.5 0.0 2.5 5.0 7.5400 500 600 700 250 275 300 325 350 375 0 50 100 Boreal GrasslandMontane

0.0

0.1

0.2

0.3

0.4

400

500

600

700

250

275

300

325

350

375

0

50

100

0.02.55.07.5

0.02.55.07.5

0.02.55.07.5

Voil?. Now you can see the plot matrix among MAT, MAP, MSP, DRYNESS with BIOME, and also usedifferent colors distinguish BIOME types.

5.5. Barplots

Bar charts and dot plots are used with summary statistics to compare class means, i.e. from experimentaltreatments or from sampling sites. In either case, you don’t have a sensible scale on your x-axis, i.e. _**variable is a factor**_, such as “Control”, “Nitrogen”, “Phosphorus”, “Nitrogen & Phosporous”. Becausethe bars/dots can be grouped, these graph types work well for factorial experiments with multiple treatments(or for hierarchical sampling designs).

Let’s use the other dataset(‘lentils_summary.csv’) contained in the zip file for barplots. The data set is thelentil yield measurements of three varieties from two farms, and the average yield amount and its standarddeviation can be found in column ‘mYIELD’ and ‘sdYIELD’, respectively.

Once you load the data set, try out the following code for barcharts:

10

dat3 <- read.csv('lentils_summary.csv', header = T)

ggplot(dat3, aes(x = FARM, y = mYIELD, fill = VARIETY)) +geom_col()ggplot(dat3, aes(x = FARM, y = mYIELD, fill = VARIETY)) +geom_col(position = 'dodge')

You can add an outline to enhance the contrast:

a <- ggplot(dat3, aes(x = FARM, y = mYIELD, fill = VARIETY)) +geom_col(position = 'dodge', color='black') +scale_fill_brewer(palette='Pastel2')

a

Next, let’s add error bars to represent uncertainty. As an example, we drawn error bars for the standarddeviation of yield. Note that we now need a slightly different position indicator. The default dodge amountfrom the main plot is 0.9. You can play with the number to see what it does.

a + geom_errorbar(aes(ymin = mYIELD-sdYIELD, ymax = mYIELD+sdYIELD),width = 0.3, position = position_dodge(width = 0.9))

## You could also do the error bars just on one side. We’ll save that plot to object bb <- a + geom_errorbar(aes(ymin = mYIELD, ymax = mYIELD+sdYIELD),

width = 0.3, position = position_dodge(width = 0.9))b

0

200

400

600

Farm1 Farm2

FARM

mY

IELD

VARIETY

A

B

C

0

200

400

600

Farm1 Farm2

FARM

mY

IELD

VARIETY

A

B

C

Then, you can add letters that indicate significance. Different letters indicate a significant difference betweentreatments. If they have the same letter, the difference is not statistically significant. We place the labelson top of the error bars:

c <- b + geom_text(aes(x = FARM, y = mYIELD + sdYIELD + 30,group = VARIETY,label=c('a','b','b','a','b','b')),position = position_dodge(width = 0.9))

c

Another common customization would be to change the order of your treatments. For example, you maywant to start with the Farm2 than Farm1:

# easy to change the order of class levels:c + scale_x_discrete(limits = c('Farm2', 'Farm1'))# You can customize your lables, as usual, and if you have long labels,# it’s sometimes handy to put your labels on an angle

11

c + labs(y="Yield (kg/ha)", x="", title="Lentil Plots") +theme(axis.text.x = element_text(angle = 45),axis.text.y = element_text(angle = 0))

# You can flip the chart and apply themesc + coord_flip() + theme_minimal()

bb

a

bb

a

0

200

400

600

800

Farm2 Farm1

FARM

mY

IELD

VARIETY

A

B

C

bb

a

bb

a

0

200

400

600

800

Farm

1

Farm

2

Yie

ld (

kg/h

a) VARIETY

A

B

C

Lentil Plots

b

b

a

b

b

a

Farm1

Farm2

0 200 400 600 800

mYIELD

FAR

M

VARIETY

A

B

C

5.6. Dot plots for factorial designs

Dot charts can be used as an alternative to bar charts. They are generally easier to read when you have manytreatments, due to their high data-to-ink ratio. Also, they do not need to start with a 0 value to convey thecorrect sense of the treatment effect (a problem with bar charts). Dot plots are also more compact, whichcan be useful if you have large experiments with many treatments or treatment levels. For the small factoriallentil experiment, it does not really matter much, but we’ll use it anyway as an illustration.

Many of the options that you have learned above can be used here as well. Change the numbers in the codebelow to see what happens. Try additional commands from the section above. You can also write what youlike into objects (like a, b, c, d above, so that you don’t have to re-run the code and keep it simple to lookat.

ggplot(data = dat3, aes(x = FARM, y = mYIELD, color = VARIETY)) +geom_errorbar(aes(ymin = mYIELD-1.96*sdYIELD, ymax = mYIELD+1.96*sdYIELD,color = VARIETY), width = 0.1) +geom_vline(xintercept = c(1, 2), linetype = 'dashed',color = 'gray60', alpha = 0.8) +scale_color_brewer(palette = 'Set2') +geom_point(size = 4) + coord_flip() + theme_bw()

Farm1

Farm2

200 400 600 800

mYIELD

FAR

M

VARIETY

A

B

C

12

Intro to GGPLOT2 - University of Alberta · 2019-03-20 · 300114 301449302343 302369 302789 304155...

Documents

Transcript of Intro to GGPLOT2 - University of Alberta · 2019-03-20 · 300114 301449302343 302369 302789 304155...