Generalized Pairs Plot

15
 The Generalized Pairs Plot John W. Emerson, Walton A. Green, Barret Schloerke, Dianne Cook, Heike Hofmann, and Hadley Wickham Department of Statistics Yale University, New Haven, CT 06520 Department of Organismic and Evolutionary Biology Harvard University, Cambridge, MA 02138 3 * Department of Statistics Iowa State, Ames, IA 50011 Department of Statistics Rice University, Houston, TX 77251 February 6, 2011 1

description

Generalized Pairs Plot

Transcript of Generalized Pairs Plot

  • The Generalized Pairs Plot

    John W. Emerson, Walton A. Green, Barret Schloerke,Dianne Cook, Heike Hofmann, and Hadley Wickham

    Department of StatisticsYale University, New Haven, CT 06520

    Department of Organismic and Evolutionary BiologyHarvard University, Cambridge, MA 02138

    3 * Department of StatisticsIowa State, Ames, IA 50011

    Department of StatisticsRice University, Houston, TX 77251

    February 6, 2011

    1

  • Authors Footnote:

    John W. Emerson ([email protected]) is Associate Professor, Depart-ment of Statistics, Yale University; P.O. Box 208290, New Haven, CT 06520-8290. Walton A. Green ([email protected]) is Research Associate, De-partment of Organismic and Evolutionary Biology, Harvard University; 26Oxford St, Cambridge, MA 02138. Barret Schloerke ([email protected])is an undergraduate student at Iowa State University; Ames, IA 50011-1210.Dianne Cook ([email protected]) is Full Professor of Statistics at Iowa StateUniversity; Ames, IA 50011-1210. Heike Hofmann ([email protected]) isAssociate Professor of Statistics at Iowa State University; Ames, IA 50011-1210. Hadley Wickham ([email protected]) is Assistant Professor of Statisticsat Rice University; P.O. Box 1892, Houston, Texas 77251-1892.

    2

  • Abstract

    This paper develops a generalization of the scatterplot matrix (also calleda pairs plot or a generalized draftsmans display) based on the recognitionthat most data sets include both categorical and quantitative information.Traditional grids of scatterplots often obscure important features of the datawhen one or more variables are categorical but coded as numerical. Instead,we offer a range of flexible options including variations on the mosaic display(where areas are proportional to counts in a contingency table), boxplots,stripplots, and other depictions of joint and conditional distributions of com-binations of categorical and quantitative variables. The use of these featuresmay reveal structure in multivariate data which otherwise might go unnoticedin the process of exploratory data analysis.

    Keywords: graphics, scatterplot matrix, generalized draftsmans display,mosaic plot, grammar of graphics, exploratory data analysis

    3

  • 1 Introduction

    The practice of graphical representation of quantitative information datesto the 18th century work of Playfair (1786). Modern graphical explorationhas grown largely from the original works of Chernoff (1973), Tukey (1977),Chambers, Cleveland, Kleiner and Tukey (1983), Tufte (1983), and Cleve-land (1985). Since then, some major methodological contributions includeCleveland (1993), Becker, Cleveland and Shyu (1996) and Wilkinson (1999).There have been numerous other contributions, many cited in this paper.Most modern methods of graphical display have been implemented in the Rlanguage and environment for statistical computing (R Development CoreTeam 2005).

    This paper contributes to the development of the pairs plot, first appear-ing, to the best of our knowledge, in Hartigan (1975). It is also referred to asthe generalized draftsmans display by Tukey and Tukey (1981) and Cham-bers et al. (1983), and the scatterplot matrix (SPLOM) by Cleveland (1993)and Basford and Tukey (1999). The pairs plot is a grid of scatterplots show-ing the bivariate relationships between all pairs of variables in a multivariatedata set. Although the authors of this paper (and many other academicsand data analysts) regularly use this graphical display, it isnt clear that it iswidely used in practice. Our informal survey of several statistics texts thatinclude multiple regression revealed inconsistent use of pairs plots.

    Most data sets consist of both quantitative and categorical variables.When all variables (or when all variable of interest) are quantitative, thescatterplot matrix is a natural tool for graphical exploration. Friendly (1994)proposed an alternative form of the plot for displaying pairwise relationshipsamong a set of categorical variables based on the mosaic plot (Hartiganand Kleiner 1984). Emerson, Green and Hartigan (2006) presented the firstknown generalized pairs plot, addressing the need for a more flexible displayof a mixture of quantitative and categorical variables. Though the use ofgeneralized contrasts the original usage of Chambers et al. (1983), the authorsagreed that the name seems most appropriate and should be adopted for thispurpose.

    Section 2 presents the basic design of the generalized pairs plot. Wethen discuss two implementations available in R extension packages gpairs(Emerson and Green 2010) and GGally (Schloerke, Cook, Hofmann andWickham 2010) in Sections 3 and 4, respectively. The former approach wasa methodological development for exploratory data analysis, while the lat-

    4

  • ter focuses on these plots as a contribution to the framework of Wilkinsonsgrammar of graphics as implemented by Wickham (2009). Section 5 con-cludes with a discussion. Supplementary materials available online provideextensive additional examples.1

    2 The generalized pairs plot

    The generalized pairs plot should not be confused with the generalized drafts-mans display of Chambers et al. (1983); we regard the latter as a traditionalpairs plot or scatterplot matrix of quantative information. Figure 1 shows anexample of a scatterplot matrix of Fishers iris data (Fisher 1936), originallycollected by Anderson (1935). Here, the species is treated numerically (1 forSetosa, 2 for Versicolor, and 3 for Virginica). This plot could be improved byusing color to identify the species instead of explicitly including the numer-ical representation of species as a quantitative variable. Doing so uncoversstriking clusterings of petal and sepal meaurements by species, an exerciseleft to the reader.

    When a data set (such as the iris data) includes one or more categori-cal variables the traditional display can be less than ideal. Friendly (1994)proposed a grid of mosaic tiles for displaying sets of entirely categorical vari-ables. Our generalization takes this a step further, recognizing the need fordifferent types of panels that together can display a rich set of features ofa diverse collection of continuous and categorical variables. There are threegeneral types of displays. A panel (or tile, or display) containing a graphicor other summary information corresponding to two quantative variables iscalled purely quantative. A panel for two categorical variables is called purelycategorical. The last type corresponds to one categorical and one quantitativevariable, called a mixed display.2

    Scatterplots are naturally used with two quantitative variables, and vari-ous options may provide information on correlation, missing values, or linearor non-linear fits, for example. Mosaic plots (Hartigan and Kleiner 1984)provide a graphical display of counts in a contingency table for two categor-ical variables where areas are proportional to counts. There are several ways

    1We should consider this last part on the nature of any supplementary materials.2Walton isnt sure formal terminology is needed. He proposes quantitative-quantitative,

    quantitative-categorical, and categorical-categorical, perhaps as an alternative to standardphrasing. Worth debating further before we wordsmith extensively.

    5

  • Sepal.Length

    2.0 3.0 4.0

    ll lll

    l

    ll

    ll

    l

    lll

    l ll

    l

    l

    ll

    ll

    lll lll

    ll

    l ll

    lll

    ll

    lll l

    l lll

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    ll ll

    l

    lll

    lll

    l lll

    l l

    llllll

    l

    l

    ll

    lll

    ll

    l

    l lll

    l

    l

    ll

    l

    ll

    l

    l

    l

    ll

    lll

    l l

    ll

    ll

    l

    l

    l

    l

    ll

    l

    l ll

    lll

    lll

    l

    lll

    lll

    l

    llll l l

    l

    lllll

    l

    ll

    lll

    lll

    llll

    l

    ll

    ll

    llll

    llll

    lll

    lll

    ll

    llll

    llll

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    ll ll

    l

    lll

    lllll

    llll

    llll

    l l

    l

    l

    ll

    lll

    ll

    l

    llll

    l

    l

    ll

    l

    ll

    l

    l

    l

    ll

    lll

    ll

    ll

    ll

    l

    l

    l

    l

    ll

    l

    lll

    lll

    ll l

    l

    lll

    lll

    l

    lllllll

    0.5 1.5 2.5

    lllll

    l

    ll

    lll

    lll

    l ll

    l

    l

    ll

    ll

    lll lllll

    lll

    lll

    ll

    llll

    lllll

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    ll ll

    l

    lll

    lll

    lllll l

    llll

    l l

    l

    l

    ll

    lll

    ll

    l

    llll

    l

    l

    ll

    l

    l l

    l

    l

    l

    ll

    lll

    l l

    ll

    ll

    l

    l

    l

    l

    ll

    l

    lll

    l ll

    lll

    l

    lll

    l ll

    l

    l llll l

    l

    4.5

    6.0

    7.5

    llllll

    ll

    lll

    lll

    llll

    l

    llllllllllll

    lll

    lll

    ll

    llll

    lllll

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    llll

    l

    lll

    lllllllll

    llllll

    l

    l

    ll

    lll

    ll

    l

    llll

    l

    l

    ll

    l

    ll

    l

    l

    l

    ll

    lll

    ll

    ll

    ll

    l

    l

    l

    l

    lll

    lll

    lll

    lll

    l

    lll

    lll

    l

    lllllll

    2.0

    3.0

    4.0

    l

    lll

    ll

    l l

    ll

    ll

    ll

    l

    l

    l

    lll

    lllll

    l

    lllll

    l

    l l

    llll

    l

    ll

    l

    lll

    l

    l

    l

    l

    l ll l

    l

    ll

    l

    l

    ll

    l

    l

    l

    llll

    l

    ll

    l

    ll

    l ll

    lll

    lll

    lll

    ll

    l

    l

    ll

    l

    ll

    lll l

    ll

    l

    llll l

    l

    l

    l

    l

    l

    ll

    ll

    ll

    l

    l

    l

    l

    l ll

    l l

    ll

    ll

    l

    l

    lll

    l

    lll lll

    l

    lll

    l

    l

    l

    lSepal.Width

    l

    lll

    ll

    ll

    ll

    ll

    ll

    l

    l

    l

    lll

    lllll

    l

    llllll

    ll

    llll

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l lll

    l

    ll

    l

    l

    ll

    l

    l

    l

    llll

    l

    ll

    l

    ll

    lll

    lll

    llll l

    l

    ll

    l

    l

    ll

    l

    ll

    llll

    ll

    l

    llll l

    l

    l

    l

    l

    l

    ll

    ll

    ll

    l

    l

    l

    l

    l ll

    ll

    ll

    ll

    l

    l

    lll

    l

    lll lll

    l

    lll

    l

    l

    l

    l

    l

    lll

    ll

    ll

    ll

    ll

    ll

    l

    l

    l

    lll

    lllll

    l

    lllll

    l

    ll

    llll

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l lll

    l

    ll

    l

    l

    ll

    l

    l

    l

    llll

    l

    ll

    l

    ll

    llll

    llllll l

    l

    ll

    l

    l

    ll

    l

    ll

    llll

    ll

    l

    lll ll

    l

    l

    l

    l

    l

    ll

    ll

    ll

    l

    l

    l

    l

    lll

    ll

    ll

    ll

    l

    l

    lll

    l

    lll l ll

    l

    l ll

    l

    l

    l

    l

    l

    lll

    ll

    ll

    ll

    ll

    ll

    l

    l

    l

    lll

    lllll

    l

    llllll

    ll

    llll

    l

    ll

    l

    lll

    l

    l

    l

    l

    l lll

    l

    ll

    l

    l

    ll

    l

    l

    l

    lllll

    ll

    l

    llllllllllllll

    ll

    l

    l

    ll

    l

    ll

    llll

    ll

    l

    lllll

    l

    l

    l

    l

    l

    ll

    ll

    ll

    l

    l

    l

    l

    lll

    ll

    lllll

    l

    lll

    l

    llllll

    l

    lll

    l

    l

    l

    l

    llll lll ll l llll l

    lllll ll

    lllllllll ll lll lll llll

    lll ll ll

    lll

    lll l

    l

    ll

    lll

    l

    llll

    ll

    ll

    llll

    lll

    llll

    ll l lllll l

    ll

    lll l

    l

    l

    ll

    llll

    l

    ll l

    lll

    ll ll

    ll

    ll

    l

    l

    ll l

    lll l

    l ll

    ll

    lll

    lllll

    lllll

    ll

    ll ll lllll l llll l

    llllll l

    llll lllll l llll lll lll l

    l ll ll ll

    lll

    lll l

    l

    ll

    lll

    l

    llll

    ll

    ll

    l lll

    l ll

    llll

    ll lll ll

    l ll

    ll lll

    l

    l

    ll

    llll

    l

    ll l

    lll

    l l ll

    ll

    ll

    l

    l

    lll

    l ll ll lll

    ll

    lllllllll

    ll lll

    Petal.Length

    llllllllllllllllll

    lll ll

    lll lllll lllllllllllllllllll

    lll

    lll l

    l

    ll

    lll

    l

    lllll

    ll

    lll

    lll ll

    llll

    lllllll

    l ll

    lllll

    l

    l

    ll

    ll ll

    l

    ll l

    lll

    l lll

    ll

    ll

    l

    l

    lll

    llll

    lll

    ll

    lll

    ll lll

    l llllll

    13

    57

    llllllllllllllllllllllllllllllllllllllllllllllllll

    lllllll

    l

    llllll

    lllllllllllllll

    llll

    lllllllllllllll

    l

    l

    llllll

    l

    lll

    lllllll

    ll

    lll

    l

    lll

    llllllllllllllllllllllll

    0.5

    1.5

    2.5

    llll lll ll l llll l

    lll ll ll

    ll

    lll

    lllll

    l lll lll llll

    lll ll ll

    ll lll

    ll

    lll

    l

    l

    l

    ll ll

    l

    l

    l

    l

    ll

    l llll

    l

    llll

    ll l lllll

    ll

    llll lll

    l

    ll

    l

    l l

    l ll

    l

    llll

    l l

    l

    ll

    l

    ll l

    ll

    llll

    ll l

    l

    ll

    ll

    lll

    ll

    l

    lll

    lll

    l

    ll ll lllll l llll l

    lll llll

    lllllllll

    lllll lll l

    ll l

    lll ll ll

    llllll

    l

    lll

    l

    l

    l

    ll ll

    l

    l

    l

    l

    ll

    lllll

    l

    llll

    l l lll lll

    ll

    ll lllll

    l

    ll

    l

    ll

    l ll

    l

    llll

    l l

    l

    ll

    l

    lll

    ll

    ll ll

    ll ll

    ll

    l l

    lllll

    l

    ll

    l

    l ll

    l

    lllllllllllllll

    lllllll

    llll

    llllllllllllll

    lll

    lllllll

    lllllll

    lll

    l

    l

    l

    ll ll

    l

    l

    l

    l

    ll

    llll

    ll

    llll

    lllllllll

    ll

    llllll

    l

    ll

    l

    l l

    l ll

    l

    llll

    ll

    l

    ll

    l

    ll ll

    llll

    l

    lll

    l

    l l

    ll

    lllll

    l

    ll

    l

    lll

    l

    Petal.Width

    lllllllllllllllllllllllllllllllllllllllllll

    lllllll

    lllllll

    lll

    l

    l

    l

    llll

    l

    l

    l

    l

    llllllll

    llll

    lllllllllllllllll

    l

    lll

    ll

    lll

    l

    llll

    ll

    l

    ll

    l

    lllllllll

    llll

    ll

    ll

    lllll

    l

    lll

    lll

    l

    4.5 6.0 7.5

    llll l ll ll l llll llll ll lll llllllll ll lll lll llll lll ll ll

    ll ll ll ll lll llll lll ll llll llllllll lll l lllll lll lll ll l

    ll lll ll ll lll lll ll lll ll ll l lll l ll llll llll llll lllllll

    ll ll l llll l llll l lll lll lllll lllll l llll lll lll l l ll ll ll

    llll ll ll lll ll ll llll l lll llll lllll ll l lll lll lll l llll l

    ll lllll ll lll ll l ll lll llll lll ll ll llll l lllllll llll l ll

    1 3 5 7

    lllllllllllllllllllllll lllllllllllllllllllllllllll

    llll llll lll ll ll lllll ll lllllllllll lllllllllll lllll l

    ll lll ll llllllllll lll ll ll llll llllll l llll llll lllllll

    lllll llllllllll llllll ll lll lllll llllllllllll lllllll

    llll ll ll lll ll lllll ll ll lllll lllll l llllllll lll lllll l

    ll ll lllll lll ll lll lll llll llll ll ll lll llll l lll l llll ll

    1.0 2.0 3.0

    1.0

    2.0

    3.0

    Species

    Figure 1: A plain old pairs plot of the iris data.

    6

  • these counts can be displayed, however. In some cases, it is helpful to visuallyemphasize the two conditional distributions. In other cases, a display betterreflecting the joint distribution may be preferable. The association betweena categorical and a quantitative variable may be depicted using a box-and-whisker plot (Tukey 1977) or some variation thereof showing the conditionaldistribution.3

    Figure 2 gives a first simple example.4 For pairs of variables leading toscatterplots or boxplots, the information in the upper and lower diagonalsof this particular plot is redundant. However, the mosaic tiles between sexand day show both of the conditional distributions; the tile in row three,column four gives the distribution of day conditional on sex, for example.Histograms and bar charts on the diagonal reflect the marginal distributionsof the variables. Examples in Sections 3 and 4 illustrate a wide range ofrefinements of this basic generalized pairs plot; other variations on the thememay be envisioned and implemented by interested researchers.

    3 Exploratory Data Analysis and gpairs

    Our development of the generalized pairs plot follows in the exploratory dataanalysis (EDA) tradition of John Tukey and John Hartigan. EDA includesa wide range of activities. At the most basic level, what is a data set? In thevast majority of cases, the answer includes a description of the contents ofrows (cases, observations, subjects, . . . ) and columns (variables, char-acteristics, measurements, . . . ) as typically arranged in a spreadsheet. Arethere missing values? Are there both quantitative and categorical variables?Such simple descriptions often reveal important features and surprises thatmay demand attention prior to further analyses.

    A summary such as that shown in Table 1 is a good starting point; thesedata are from the 2010 Environmental Performance Index (Emerson, Esty,Levy, Kim, Mara, de Sherbinin and Srebotnjak 2010). Each of 231 countriesmay be classified as being landlocked (LandLock, having no direct accessto an ocean) and as having a high population density (HighPopDens). In-dices were constructed to reflect overall environmental performance (EPI) as

    3Yes, we could introduce the term fluctuation plot in this paragraph, but that seemsimplementation-specific and is a subset of the more general term mosaic plot.

    4Need some background and a proper citation of the tips data set here I think, perhapswith a comment that the plot has shortcomings, including obscuring the rounding of tips?

    7

  • total_bill

    2 4 6 8 10

    l

    l

    lll

    l

    l

    l

    l l

    l

    l

    lll

    l

    l

    lll

    ll

    l

    l

    ll

    ll

    ll

    l

    ll

    ll

    l

    l ll

    l

    lll

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    lll

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l l

    l

    l

    l

    l

    ll

    l

    l

    ll

    l

    l

    ll

    l

    ll l

    ll

    l

    l

    ll

    ll l

    l l

    l

    llll

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    lll

    l

    ll

    l

    l l

    l

    l

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l l

    ll

    l

    ll

    l

    lll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l l

    l

    ll

    l

    l

    l

    l

    ll

    l

    ll

    l

    ll

    ll

    ll

    l

    l l

    lll

    lll

    llllll

    Thur Fri Sat Sun

    lllll l

    llll ll

    1020304050

    l

    l

    lll

    l

    l

    l

    l l

    l

    l

    ll

    l

    l

    l

    lll

    ll

    l

    l

    ll

    ll

    ll

    l

    ll

    ll

    l

    l ll

    l

    lll

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    llll

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l l

    l

    l

    l

    l

    ll

    l

    l

    ll

    l

    l

    ll

    l

    ll ll

    l

    l

    l

    ll

    ll l

    l l

    l

    ll

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    ll

    ll

    ll

    l

    l l

    l

    l

    l

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l l

    ll

    l

    ll

    l

    lll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l l

    l

    ll

    l

    l

    l

    l

    ll

    l

    ll

    l

    ll

    ll

    ll

    l

    l l

    2468

    10

    ll

    l ll

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    ll ll

    ll

    l

    l

    lll

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    lll

    l

    l

    l

    ll

    ll

    l

    l

    ll

    l

    ll

    l

    l

    lll

    l l

    l

    l

    l

    l

    l

    ll ll

    l

    l

    l

    l

    l

    l

    l

    l ll

    l

    l

    l

    ll

    l

    l

    l

    ll

    l

    ll

    l

    l lll

    l

    l

    l

    l

    ll

    l

    ll

    ll

    l

    ll

    l

    l

    lll l

    l

    l

    ll

    l

    l

    l ll

    l

    l

    l

    ll

    l

    l llll

    ll

    l

    ll

    l l

    l

    ll

    l

    lll

    ll

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    lll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    ll

    l

    ll

    l

    ll

    l

    lll

    l

    l l l

    ll l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    ll l

    l

    ll

    l

    l

    l l l

    l

    l

    lll

    l

    tipl

    lll

    l

    l

    l

    ll

    ll

    ll

    ll

    ll

    lll

    l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    llll

    ll

    l

    l

    lll

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll l

    l

    l

    l

    ll

    l l

    l

    l

    l l

    l

    lll

    l

    l ll

    ll

    l

    l

    l

    l

    l

    llll

    l

    l

    l

    l

    l

    l

    l

    lll

    l

    l

    l

    ll

    l

    l

    l

    ll

    l

    ll

    l

    lll l

    l

    l

    l

    l

    ll

    l

    ll

    ll

    l

    ll

    l

    l

    ll ll

    l

    l

    ll

    l

    l

    lll

    l

    l

    l

    l l

    l

    ll ll ll

    l

    l

    ll

    ll

    l

    ll

    l

    lll

    ll

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    lll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    ll

    l

    ll

    l

    ll

    l

    lll

    l

    lll

    l ll

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l

    l

    l

    l

    ll

    l ll

    l

    ll

    l

    l

    lll

    l

    l

    lll

    l

    ll llll

    lll lll

    lll ll l

    lsex

    Mal

    eFe

    ma

    le

    ll

    ll ll l l

    Sun

    Sat

    FriT

    hur

    ll

    ll ll

    l

    lll ll

    ll llll

    l l day

    ll

    ll

    10 20 30 40 50

    l

    l l lll

    l

    ll

    ll

    ll

    ll ll

    lll

    l

    lll

    lll

    ll

    ll l

    l

    l

    ll

    l

    l

    llll

    l

    lll

    ll

    l

    ll

    l

    ll ll

    ll

    l llll

    l

    l ll

    l

    ll

    l l

    l

    ll

    ll

    lll l

    ll

    ll l

    l

    l l

    ll

    ll

    l

    ll

    ll ll

    ll

    l

    ll

    l

    ll

    l

    l

    l

    lll

    ll

    lll l

    l

    ll

    l

    l

    ll

    ll

    ll

    lll ll

    lll

    l l l

    l

    ll

    l

    l

    ll

    l

    ll l

    ll

    l

    ll

    l

    ll

    l

    lll

    ll

    lll ll

    l

    l

    l

    lll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    ll

    ll l

    lll

    l

    ll lll

    lll

    l

    l

    l lll

    l ll

    l

    l

    l

    ll

    l

    ll

    ll

    l

    ll

    l

    l

    l

    l

    ll

    l

    l

    l

    lll

    l

    l

    l llll

    l

    ll

    ll

    ll

    ll ll

    lll

    l

    lll

    lll

    ll

    ll l

    l

    l

    ll

    l

    l

    llll

    l

    lll

    ll

    l

    lll

    ll ll

    ll

    l llll

    l

    l ll

    l

    ll

    l l

    l

    ll

    ll

    llll

    ll

    ll ll

    l l

    ll

    ll

    l

    ll

    ll ll

    ll

    l

    ll

    l

    ll

    l

    l

    l

    llll

    ll

    ll l

    l

    ll

    l

    l

    ll

    ll

    ll

    lll lllll

    l l l

    l

    ll

    l

    l

    ll

    l

    ll l

    ll

    l

    ll

    l

    ll

    l

    lll

    ll

    lll ll

    l

    l

    l

    lll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    ll

    ll l

    lll

    l

    ll lll

    lll

    l

    l

    l lll

    l ll

    l

    l

    l

    ll

    l

    ll

    ll

    l

    ll

    l

    l

    l

    l

    lll

    l

    l

    lll

    l

    Female Malell

    l

    l

    l

    l

    l

    l ll

    l

    l

    20 40 60

    20

    40

    60percent

    Figure 2: Most basic example?

    8

  • well as performance on two subcategories, environmental health (ENVHEALTH)and ecosystem vitality (ECOSYSTEM). The range of observations of the sub-categories were coordinated because standardizing the variances led to unde-sirable features of the aggregate index; the range does not span the intervalfrom 0 to 100 because even the highest-performing country, for example, stillhas room for improving many aspects of the underlying performance metrics.Unfortunately, missing data prevented the construction of the ecosystem vi-tality index (and hence the overall environmental performance index) for 68of the countries; there was less missing information for environmental health(49 countries).

    variable.name type missing distinct.values precision min max1 Country character 0 231 NA AFG ZWE2 EPI numeric 68 163 1e-08 32.12 93.483 Landlock pure factor 0 2 NA No Yes4 HighPopDens pure factor 0 2 NA No Yes5 ENVHEALTH numeric 49 173 1e-08 0.06 95.096 ECOSYSTEM numeric 68 163 1e-08 0.06 95.09

    Table 1: A summary of a subset of the 2010 Environmental Performancedata using the whatis() function of R package YaleToolkit.

    EDA almost always proceeds with tables of categorical variables and uni-variate graphical displays such as histograms. Bivariate associates are oftenexplored with scatterplots and side-by-side boxplots, as appropriate, withtwo-way tables and mosaic plots used for pairs of categorical variables. Theboxplot shown in Figure 3 provides a standard graphical exploration of thebivariate association between a categorical variable (landlocked status, inthis case) and a continuous variable (the environmental health index). An-other popular alternative is a pair of stacked histograms, though the choiceof histogram bin widths may reveal or obscure information in the conditionaldistributions.

    Emerson et al. (2006) presented the barcode plot, originally developed byHartigan in the spirit of rugs and stripplots (see Chambers and Hastie (1992),for example) and named because of its similarity to the Universal ProductCode (UCP) on commercial packaging. Figure 4 shows how the barcodeplot can reveal an interesting aspect of the data not evident in the boxplot

    9

  • No

    Yes

    0 20 40 60 80

    Environmental Health

    Figure 3: A boxplot example.

    (Figure 3) and often obscured by histograms. In this case the tall spike inthe bottom-right part of the display leads to the discovery that Germany,Finland, France, Luxembourg, Norway, and New Zealand have identical val-ues of the environmental health index (only Luxembourg is landlocked). Inaddition, several pairs of countries were tied with similarly strong records ofenvironmental health. The barcode plot exhibits a histogram-like appear-ance in the presence of ties in quantitative variables, and discovery of suchties often lead to further data exploration.

    Our initial development of the generalized pairs plot combined scatter-plots, mosaic plots, and the detailed barcode plots with the higher-level sum-mary of traditional boxplots. Figure 5 shows a generalized pairs plot of se-lected variables from the 2010 Environmental Performance Index using thegpairs function of R extension package gpairs (Emerson and Green 2010).This particular plot avoids redundant displays, and highlights the United

    10

  • 0 20 40 60 80

    Environmental Health

    No

    Yes

    Figure 4: A barcode example. Jay and Walton need to consider adding aylab= option which in this case would be Landlocked.

    11

  • States with a larger red point in the scatterplot panels. For pairs of quanti-tative variables, scatterplots appear above the diagonal. Below the diagonal,correlations (with significance at the 5% level marked with an asterix bydefault) and numbers of missing values provide additional information, withthe shading providing visual reinforcement of the associations. For the twocategorical variables (Landlock and HighPopDens), the corresponding mo-saic tiles depict the two conditional distributions; Landlock conditional onHighPopDens is shown in the third row, second column, for example.

    Other options are supported but not shown here. Stripplots may be usedin place of boxplots or barcode plots, for example. Points may be customizedin scatterplot panels using alternative symbols, sizes and colors for the ex-ploration of high-dimensional patterns. A companion function, corrgram, isprovided for convenience (see Friendly (2002) for a nice discussion of theseplots).

    4 An Extension of the Grammar of Graphics

    and ggpairs

    5 Discussion

    References

    Anderson, E. (1935), The irises of the Gaspe Peninsula, Bulletin of theAmerican Iris Society, 59, 25.

    Basford, K. E., and Tukey, J. W. (1999), Graphical analysis of multiresponsedata : illustrated with a plant breeding trial, Boca Raton, Fla.: Chapman& Hall/CRC.

    Becker, R. A., Cleveland, W. S., and Shyu, M. J. (1996), The Visual Designand Control of Trellis Display, Journal of Computational and GraphicalStatistics, 5(2), 123155.

    Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983),Graphical Methods for Data Analysis, Belmont, California: WadsworthInternational Group.

    12

  • EPI

    Yes

    No 0 20 40 60 80

    l

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    ll

    l

    lll l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    ll ll

    l

    l

    l

    l l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    ll

    ll

    lll

    l

    ll

    ll

    l l

    405060708090

    l

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    ll

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    ll

    l

    l lll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    lll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    ll

    ll

    lll

    l

    ll

    ll

    ll

    Yes

    NoLandlock

    HighPopDens

    Yes

    No

    020406080

    0.77*

    68 missing

    ENVHEALTH

    l

    l

    lll

    l

    l l

    l

    l

    l

    l

    l

    l

    l

    ll l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    ll

    l

    l

    lll

    l ll

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l

    ll l

    ll

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    lll

    l

    l

    ll

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    l

    l

    l l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    ll

    lll

    ll

    ll

    l

    l

    l

    l

    l

    l

    l

    l

    405060708090

    0.32*

    68 missing

    lll

    No

    Yes

    ll

    l

    0.36*

    68 missing

    0 20 40 60 80

    020406080ECOSYSTEM

    Figure 5: Heres a nice gpairs caption.

    13

  • Chambers, J. M., and Hastie, T. J. (1992), Statistical Models in S, PacificGrove, California: Wadsworth & Brooks/Cole Advanced Books & Soft-ware.

    Chernoff, H. (1973), The uses of faces to represent points in k-dimensionalspace graphically, Journal of the American Statistical Association,68(???), 361368.

    Cleveland, W. S. (1985), The Elements of Graphing Data, Monterey, Cali-fornia: Wadsworth Advanced Books and Software.

    Cleveland, W. S. (1993), Visualizing Data, Summit, New Jersey: HobartPress.

    Emerson, J. W., Esty, D. C., Levy, M. A., Kim, C. H., Mara, V., de Sherbinin,A., and Srebotnjak, T. (2010), 2010 Environmental Performance Index,New Haven, Connecticut: Yale Center for Environmental Law and Pol-icy.

    Emerson, J. W., and Green, W. (2010), YaleToolkit: Data exploration toolsfrom Yale University. R package version 3.2.URL: http://CRAN.R-project.org/package=YaleToolkit

    Emerson, J. W., Green, W. A., and Hartigan, J. A. (2006), Barcodes, Gen-eralized Pairs Plots, and Sparkmats,, UseR! Conference presentation,Vienna.

    Fisher, R. A. (1936), The use of multiple measurements in taxonomic prob-lems, Annals of Eugenics, 7.

    Friendly, M. (1994), Mosaic displays for multi-way contingency tables,Journal of the American Statistical Association, 89, 190200.

    Friendly, M. (2002), Corrgrams: Exploratory displays for correlation matri-ces, American Statistician, 56(4), 316324.

    Hartigan, J. A. (1975), Printer graphics for clustering, Journal of StatisticalComputation and Simulation, 4, 187213.

    Hartigan, J., and Kleiner, B. (1984), A mosaic of television ratings, Amer-ican Statistician, 38(???), 3235.

    14

  • Playfair, W. (1786), The commercial and political atlas; representing, bymeans of stained copper-plate charts, the exports, imports, and generaltrade of England, at a single view. To which are added, Charts of therevenue and debts of Ireland, ... by James Corry, London: Debrett,Robinson, and Sewell.

    R Development Core Team (2005), R: A language and environment for statis-tical computing, R Foundation for Statistical Computing, Vienna, Aus-tria. ISBN 3-900051-07-0.URL: http://www.R-project.org

    Schloerke, B., Cook, D., Hofmann, H., and Wickham, H. (2010), GGally:Extension to ggplot2. R package version 0.2.2.1.URL: http://CRAN.R-project.org/package=GGally

    Tufte, E. (1983), The Visual Display of Quantitative Information, Cheshire,Conn.: The Graphics Press.

    Tukey, J. W. (1977), Exploratory Data Analysis, Reading, Mass.: AddisonWesley.

    Tukey, P. A., and Tukey, J. W. (1981), Graphical Display of Data Setsin Three Or More Dimensions, in Interpreting Multivariate Data, ed.V. Barnett, Chichester, United Kingdom: Wiley and Sons, pp. 189275.

    Wickham, H. (2009), ggplot2: elegant graphics for data analysis, New York:Springer.URL: http://had.co.nz/ggplot2/book

    Wilkinson, L. (1999), The Grammar of Graphics, New York: Springer-Verlag.

    15