GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

14
331 Physical Geography, 2007, 28, 4, pp. 331–344. DOI: 10.2747/0272-3646.28.4.331 Copyright © 2007 by Bellwether Publishing, Ltd. All rights reserved. GEOGRAPHIC BOX PLOTS Cort J. Willmott Center for Climatic Research Department of Geography University of Delaware Newark, Delaware 19716 Scott M. Robeson Department of Geography Indiana University Bloomington, Indiana 47405 Kenji Matsuura Center for Climatic Research Department of Geography University of Delaware Newark, Delaware 19716 Abstract: Traditional box plots, described by Tukey in 1977, can be biased representa- tions of geographic or spatial variables, since they do not take into account the areas asso- ciated with the elements of geographic variables. To help solve this problem, we propose and evaluate a spatially (areally) weighted box plot, a “geographic box plot,” that can be used to describe a wide variety of geographic variables. Much of our discussion of box plots also applies to “traditional” and “geographic” histograms and percentile plots. Tradi- tional and geographic box plots of one hypothetical and two real-world geographic vari- ables are constructed and compared to illustrate our points, as well as to illustrate the often marked differences between traditional and geographic box plots of the same geographic variables. The real-world geographic variables considered here are human population density (in the year 2000) within the contiguous 48 United States and five-year-average (January 2001–December 2005) air temperature over the Arctic land surface. Our tradi- tional box plot of population density over represented the small, high-density states, whereas our corresponding geographic box plot correctly depicted the geographic distri- bution of population density. A geographic box plot correctly accounted for the reduction in spherical grid-node areas with latitude over the Arctic land surface, whereas a tradi- tional box plot of the spherical grid-node values did not. Our hope is that geographers and other scientists will make refinements to and find many more applications for our geo- graphic box plots. [Key words: statistical graphics, spatial statistics, visualization, box plots.] INTRODUCTION Each element of a digital (discretized) geographic or spatial variable almost always represents a finite area on Earth’s surface, either directly or indirectly. And, yet, the sizes or relative sizes of these finite areas are very rarely taken into account

Transcript of GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

Page 1: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS

Cort J. WillmottCenter for Climatic Research

Department of GeographyUniversity of Delaware

Newark, Delaware 19716

Scott M. RobesonDepartment of Geography

Indiana UniversityBloomington, Indiana 47405

Kenji MatsuuraCenter for Climatic Research

Department of GeographyUniversity of Delaware

Newark, Delaware 19716

Abstract: Traditional box plots, described by Tukey in 1977, can be biased representa-tions of geographic or spatial variables, since they do not take into account the areas asso-ciated with the elements of geographic variables. To help solve this problem, we propose and evaluate a spatially (areally) weighted box plot, a “geographic box plot,” that can be used to describe a wide variety of geographic variables. Much of our discussion of box plots also applies to “traditional” and “geographic” histograms and percentile plots. Tradi-tional and geographic box plots of one hypothetical and two real-world geographic vari-ables are constructed and compared to illustrate our points, as well as to illustrate the often marked differences between traditional and geographic box plots of the same geographic variables. The real-world geographic variables considered here are human population density (in the year 2000) within the contiguous 48 United States and five-year-average (January 2001–December 2005) air temperature over the Arctic land surface. Our tradi-tional box plot of population density over represented the small, high-density states, whereas our corresponding geographic box plot correctly depicted the geographic distri-bution of population density. A geographic box plot correctly accounted for the reduction in spherical grid-node areas with latitude over the Arctic land surface, whereas a tradi-tional box plot of the spherical grid-node values did not. Our hope is that geographers and other scientists will make refinements to and find many more applications for our geo-graphic box plots. [Key words: statistical graphics, spatial statistics, visualization, box plots.]

INTRODUCTION

Each element of a digital (discretized) geographic or spatial variable almost always represents a finite area on Earth’s surface, either directly or indirectly. And, yet, the sizes or relative sizes of these finite areas are very rarely taken into account

331

Physical Geography, 2007, 28, 4, pp. 331–344. DOI: 10.2747/0272-3646.28.4.331Copyright © 2007 by Bellwether Publishing, Ltd. All rights reserved.

Page 2: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

332 WILLMOTT ET AL.

when constructing or interpreting a distributional graphic—a box plot, a histogram or a set of percentiles—associated with a geographic or spatial variable of interest, even when the subsequent analyses are spatially explicit (cf. Wise et al., 2001; Anselin et al., 2006; Broennimann et al., 2006). One purpose of this paper, there-fore, is to introduce a way in which to take these areas into account when construct-ing a box plot of a digital geographic variable. The implications of including or excluding the influences of these areas in a box plot are considered as well. A box plot that takes into account the finite areas associated with its variable’s elements we call a “geographic box plot,” whereas a box plot that does not we refer to as a “traditional box plot.” Many of our points also apply to the construction of geo-graphic histograms and percentiles.

Our interest in box plots and other visualization tools developed, in part, because most graphs and maps of climate change oversimplify the change. Time-change graphs of global warming, for instance, often only show estimated changes in a single statistic, usually the spatial mean, over time. Considerably less fre-quently, climatologists have attempted to show time changes in the spatial distribu-tion of climate by drawing time series of a changing distributional graphic. The traditional (nonspatial) box plot has been used most commonly, albeit inappropri-ately (e.g., Grotch and MacCracken, 1991). And so, we have begun to explore bet-ter ways to visualize climate change; that is, to show temporal changes in the spatial distribution of climate. Rawlins and Willmott (2003), for example, used simple geo-graphic box plots to show changes in the spatial distribution of annual-average air temperature within the Arctic over a 30-year period, while Robeson (2002, 2004) used time-varying percentiles and cluster analysis to show geographic regions that had similar trends in air-temperature frequency distributions.

It is important to mention that geographers (e.g., Haining, 2003; Anselin et al., 2006) and others (e.g., Cressie, 1993) have made significant improvements in our statistical analyses of geographic or spatial data. Commensurate advances in our distributional graphics (box plots, histograms, percentile graphs, et cetera) of geo-graphic or spatial data, however, do not appear to have been made. Within the widely used GeoDa software, for instance, Anselin et al. (2006) employ traditional (nonspatial) box plots and histograms to help visualize spatial variables, which illustrates this point. The limited attention paid to area-weighting within the distri-butional graphics of geographic variables is somewhat surprising given that area-weighting is a well-known component of spatial statistics (Haining, 2003).

Our discussion begins with a brief description and assessment of both traditional and geographic box plots. Included within this discussion is a synopsis of ways to construct these two types of box plots. A hypothetical geographic variable is intro-duced and used to show some important differences between traditional and geo-graphic box plots of geographic variables. Two disparate real-world examples—human population density within the United States and five-year-average air tem-perature over the Arctic land surface—then are introduced and examined to illus-trate some striking differences that can occur between traditional and geographic box plots of real geographic variables.

Page 3: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 333

BOX PLOTS

Traditional Box Plots

The box plot was proposed by John Tukey (1977) as a means of summarizing graphically the frequency distribution of one or more variables of interest. Initially, Tukey’s box plot (he called it the “box-and-whisker” plot) was based on five ordinal markers (numbers) derived from a variable of interest. The five numbers were the minimum observation, the lower “hinge” (the lower quartile), the median, the upper hinge (the upper quartile), and the maximum observation. The median and then the two quartiles were selected or estimated according to straightforward counting and interpolation rules (Tukey, 1977). When drawn in portrait mode, the graphical elements of a box plot include the “box” (a rectangle bounded on the bot-tom by the lower quartile or hinge and on the top by the upper quartile or hinge), two vertical “whiskers” (one extending from the lower quartile to the minimum value and the other from the upper quartile to the maximum value), and a horizon-tal line, drawn across the box, to represent the median (Fig. 1). The width of the box may be selected for aesthetic reasons only (as in Fig. 1) or it may be scaled to con-vey useful information, as we discuss below (in the “Indicating Reliability” subsec-tion). Any box plot that includes these five (unweighted) numbers we refer to as a “traditional box plot,” although we also recognize that there have been a number of

Fig. 1. Traditional and geographic (five-number) box plots of a hypothetical geographic variable (Table 1). The traditional box plot considers each observation to be of equal weight, whereas the geo-graphic box plot considers each observation to be weighted proportionally to the area that it represents. Within the rectangle-stack depictions of the sorted traditional and geographic variables, each weight is indicated by the relative-size (area) of the rectangle surrounding the observation. Blackened circles represent the positions of the traditional and geographic means. The width of each hypothetical box plot, along the abscissa, is arbitrary.

Page 4: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

334 WILLMOTT ET AL.

“enhancements” of or variations on the original box plot (Cleveland, 1993; Wilks, 2006). Visual separation among the five primary numbers facilitates rapid interpre-tation of variability, symmetry or asymmetry, and extremes. The simplicity of the box plot also enables rapid appreciation of differences and similarities among sev-eral variables, when each is represented by a box plot and the units of observation are comparable (cf. Cleveland, 1993; Rawlins and Willmott, 2003).

Our approach to constructing a five-number traditional box plot essentially fol-lows Tukey (1977). The first two steps in putting together a traditional box plot of a variable (x) are to count (to obtain n) and sort (here, in ascending order) the ele-ments of x (Table 1), where n is the number of elements or observations within x. Using the set of sorted elements (x'), the next steps are to select the minimum (x'1) and maximum (x'n) observations, and to identify or estimate the median (or 50th percentile). The final steps are to select or estimate the upper and lower quartiles. There is more than one way to evaluate the median and also the quartiles (Cleveland, 1993); however, linear interpolation between adjacent, sorted elements usually is employed. Within the sorted variable (x'), the position (M) of the median (or 50th percentile) can be obtained from M = 0.5(n + 1). The value of the median then can be interpolated according to

(1)

where t = M-int(M) and the function int( ) excises any decimal part of M and retains the integer part. With L = 0.5(n/2 + 1) and U = 0.5(3n/2 + 1), replacing M with L or U in equation 1 solves for the lower ( ) or upper ( ) quartile. Note that 0.5(3n/2 + 1) is a simplification of [(n + 1) – 0.5(n/2 + 1)]. Once these five numbers have been estimated, a “traditional box plot” may be rendered (Fig. 1). While the

Table 1. Hypothetical Data (Arbitrary Units for Both the Observations and Areas) Used to Demonstrate the Need for and Use of Geographic Box Plotsa

i xi ai x'i a'i

1 4 1 1 1 1/10 1/10

2 9 1 3 1 1/10 2/10

3 1 1 4 1 1/10 3/10

4 3 1 6 1 1/10 4/10

5 10 2 7 3 3/10 7/10

6 6 1 9 1 1/10 8/10

7 7 3 10 2 2/10 10/10aThese data are plotted as both traditional and geographic box plots in Figure 1. Reading the Table from left to right, the variables (columns) are: the index or subscript (i) associated with each obser-vation, the observations (xi), the areas associated with the observations (ai), the sorted observations (x'i), and the areas associated with the sorted observations (a'i). Each element of the second-to-last column is the proportion of the total area associated with a sorted observation, whereas each ele-ment of the last column is a cumulative proportion (from 1 to and including i) made from the ele-ments (proportions) in the second-to-last column.

a'i a'kk 1=

n

∑⁄ a'kk 1=

i

∑ a'kk 1=

n

∑⁄

x'ˆ M x'int M( ) x'int M( ) 1+ x'int M( )–( )t+=

x'ˆ L x'ˆ U

Page 5: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 335

arithmetic mean is not an essential component of a box plot, it too can be and often is indicated within box plots (e.g., Fig. 1).

Outliers can and often do have an undo and deleterious influence on interpreta-tions of variability as well as on the values of many parametric statistics, such as the mean, variance, and regression parameters. It is prudent, therefore, to show important outliers within the context of a box plot, in addition to the two extremes. Tukey (1977), of course, recognized the importance of depicting nontrivial outliers, and he proposed a way to augment his box plot to indicate meaningful outliers. He defined meaningful outliers as values of x that are greater than 1.5 times the interquartile range (r =

) larger than the upper quartile ( ) or smaller than the lower quartile ( ). In the presence of meaningful outliers, Tukey’s box plot can be redrafted such that the vertical lines (the “whiskers”) extend from the lower and upper edges of the box to the smallest and largest observations, respectively, that are not considered to be outliers; that is, instead of to the minimum and maximum values. Outliers of interest then can be plotted individually. Our “real-world” examples (see the “Real-world…” section, below) employ such an extended form of Tukey’s box plot.

Geographic Box Plots

As with traditional box plots, the first two steps in constructing a geographic box plot are to count and sort the elements of x (Table 1). However, since each element of x now has an area associated with it, there is a second n-element variable (a), composed of the corresponding areas, that has to be taken into account. Within a, a1 is the area represented by x1, a2 is the area represented by x2, a3 is the area rep-resented by x3 and so on. Now, when x is sorted to obtain (x'), a corresponding set of the areas (a') has to be created from a, such that each element of a' is in pairwise correspondence with the value of x' that it represents (cf. Table 1). In other words, a'1 is the area represented by x'1, a'2 is the area represented by x'2, a'3 is the area represented by x'3 and so on.

One aspect of constructing a geographic box plot differs considerably from our description of constructing traditional box plots: in the vast majority of geographic instances, no interpolation from the adjacent sorted elements is required. Since the finite areas (ai s) associated with the xi s (e.g., Table 1) completely span the study area, any geographic percentile of interest (e.g., the geographic median) typically falls within a polygon or grid cell; in turn, the percentile should be assigned the value of x which is associated with and represents that polygon or grid cell (Fig. 1). It also should be kept in mind that, within the sorted-areas variable (a'), the adjacent areas are not necessarily contiguous geographically and this makes distance-based interpolation inappropriate. Interpolation is needed only in the rare instance that a percentile falls exactly on a boundary between two or more polygons or grid cells. An adequate way to make such an interpolation may simply be to take the average of the adjacent polygon or grid-cell values.

When an element of a digital variable is georeferenced, it is usually associated with an area on Earth’s surface; in turn, the n-element variable (x) represents n areas as well as spatial variation in x over Earth’s surface. Consider two well-known and often-analyzed geographic variables, human population density (p) and surface air

x'ˆ U x'ˆ L– x'ˆ U x'ˆ L

Page 6: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

336 WILLMOTT ET AL.

temperature (T). The connection between p data and the census polygons (areas) from which they were obtained is both clear and fundamental. Human population density within a particular census enumeration district (pi) represents all locations within the area (ai) bounded by the census polygon. An estimate of T made from a satellite observation or by a climate model is a grid-cell-area average and therefore intrinsically represents an area. Similarly, a spatially interpolated grid-node T value may be an estimated point value but also one that is commonly taken to represent the grid box (area) within which it lies. Our point is that the vast majority of georef-erenced, digital data represent finite areas and yet traditional box plots of these data do not take area into account. Geographic box plots directly consider the associ-ated areas of each geographic or spatial variable.

Within some circumstances, the sizes of the areas associated with the elements of a geographic variable may be unclear. It is often useful in these situations to spatially interpolate the values of the variable to the nodes of a regular grid, primarily because the areas of regular-grid cells are easily found. Climate scientists, for instance, usu-ally interpolate their observed climate variables (e.g., air temperature) from the weather-station locations where they are observed to the nodes of a regular spherical grid prior to any spatial analysis (e.g., Willmott and Matsuura, 1995). This makes the spatial analyses of the variable vastly more straightforward than if the spatial analy-ses are performed on the irregularly spaced weather-station values. We note that, in the special case that the grid is an equal-area grid, traditional and geographic box plots of the gridded variable will be commensurate (White et al., 1992; Carr et al., 1999; Sahr et al., 2003; Robeson, 2004). Most geographic variables are not resolved onto equal-area grids, however, and sometimes for good reason; therefore, it should not be concluded that interpolation onto an equal-area grid prior to rendering a box plot or other distributional graphic is a general solution to our problem.

Obtaining the geographic minimum and maximum values is easy in that they are the same two extremes selected for the traditional box plot; that is, they are x'1 and x'n, respectively. Finding geographic percentiles—such as the geographic median and two quartiles (25th and 75th geographic percentiles)—within the sorted digital variable, however, differs considerably from evaluating their traditional counter-parts. Our geographic median is taken as the value or estimated value of x' that, to the extent possible, separates the 50 percent of the study area with values less than the geographic median from the other 50 percent of the study area that contains greater values. In the same vein, we take the lower geographic quartile as the value of x' that, as nearly as possible, bounds the 25 percent of the study area that con-tains values less than the value of the lower geographic quartile. The upper geo-graphic quartile is obtained analogously but for the 25 percent of the study area that contains values greater than that of the upper geographic quartile. Our procedure for selecting geographic percentiles of interest from x' and a' is illustrated below.

Working with the sorted variable (x'), the initial step in identifying the geographic median is to locate the first i, as i increases, for which:

;

then, the geographic median can be taken as x'i. The two geographic quartiles (25th and 75th geographic percentiles) can be found in an analogous fashion, where the

a'kk 1=

i

∑ a'kk 1=

n

∑⁄ 0.5>

Page 7: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 337

thresholds are 0.25 and 0.75 rather than 0.5. Once these five numbers have been obtained, a “geographic box plot” may be drawn (Fig. 1).

Indicating Reliability

Within a traditional or geographic box plot, information about the reliability of the box plot can be valuable. A useful and straightforward way to infer reliability—the extent to which the observations are a representative sample—is to make the width of the box proportional to a measure of reliability. It is not uncommon, for example, to encounter a traditional box plot whose width is a function of n(Haining, 2003), where n is assumed to be related to reliability, directly. Reliability information can be especially important when one is comparing a number of box plots, each of which represents a different sample of the same variable. However, when constructing a geographic box plot, using a reliability measure that is a func-tion of n only is usually inappropriate because n is not consistently related to the spatial structure of a sample; that is, to where the sample points are located. As a consequence, we recommend making the width of a geographic box plot propor-tional to an appropriate measure of the spatial fidelity of the sample; for example, to a measure of typical sample-point density within the study area. When the data are in the form of a gridded field, the width of a geographic box plot could be made inversely proportional to the spatial resolution of the data.

“REAL-WORLD” COMPARISONS OF TRADITIONAL ANDGEOGRAPHIC BOX PLOTS

Population Densities within the Contiguous United States

Human population density (p) within a census enumeration district is usually estimated by dividing the district’s population count or estimate by the district’s land-area extent. The density estimate, in turn, has the dimensions of persons per unit area and it represents all of the area within the enumeration district. Within this paper, we evaluate p for each of the 48 contiguous United States (using the 2000 Census Bureau estimate of each state’s population), and use these 48 density esti-mates to help us show why geographic box plots are preferred, over traditional box plots, for representing the variation within a geographic or spatial variable.

When p for each of the lower 48 states is mapped (Fig. 2), two important aspects of the spatial variability in p are apparent. In the first instance, it is clear that the highest values of p occur within the eastern reaches of the country, with the notable exception of California, while relatively low values of p appear principally within the west; once again, with the marked exception of California. It also is clear that the areal extents (sizes) of the states are quite variable, and that many of the highest values of p are found within relatively small states. It follows that both the popula-tion densities and areas of the states should be taken into account when visualizing the distribution of p across the contiguous United States.

A traditional box plot, constructed from the 48 values of p alone (Fig. 3), over represents small, high-density states and, in turn, has a biased distribution of pvalues across the United States. The estimated (with each state’s density equally

Page 8: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

338 WILLMOTT ET AL.

weighted) mean and median, for example, are far too large, about 71 and 34 per-sons per km2. A geographic box plot (Fig. 3), in contrast, depicts considerably lower

Fig. 2. Estimated human population density (persons/km2) within each of the 48 contiguous United States in 2000.

Fig. 3. Traditional and geographic box plots of human population densities (persons/km2) across the 48 contiguous United States in 2000.

Page 9: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 339

estimates of the mean and median of p, 36.4 and 21.4 respectively, which are cor-roborated by Census Bureau estimates. The interquartile range associated with the traditional box plot also is too large, which in turn alters the definition of an outlier (New York and Delaware are not considered to be outliers in the traditional box plot but they are in the geographic box plot). Our two box plots of the spatial distribu-tion of p confirm that, when geographic observations of interest are associated with areas of differing size, traditional box plots are biased representations and geo-graphic box plots are preferred.

Average Air Temperature across the Arctic Land Surface

Our station-average Arctic air temperatures were derived from five years of monthly air-temperature observations (January 2001 through December 2005) that

Fig. 4. Locations of the 2,060 GHCN (Version 2) and GSOD stations north of 40ºN with a sufficient number of monthly average air-temperature observations (within the period from 2001–2005) from which to compute a five-year mean.

Page 10: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

340 WILLMOTT ET AL.

were drawn from available station records contained within an updated version of the Global Historical Climatology Network (GHCN), version 2 (Peterson and Vose, 1997) and the National Climatic Data Center (NCDC) Global Summary of the Day (GSOD) archive (which is updated daily). Each station selected had no more than 10 percent of the monthly observations missing, and its five-year air-temperature record was averaged temporally to obtain the five-year mean; that is, to obtain our estimate of average air temperature at the station. A map of the 2,060 available sta-tion locations (Fig. 4) illustrates the uneven spatial distribution of station air-temperature records over the Arctic land surface.

Our 2,060 station-average air temperatures are interpolated spatially to a 0.5º × 0.5º spherical grid using an updated version of Willmott and Matsuura’s (1995) digital-elevation-model-assisted (DEM-assisted) interpolator. The interpolator makes combined use of the average air temperatures at the stations, a DEM, an

Fig. 5. DEM-assisted interpolation of five-year-average air temperature (ºC) over the Arctic land surface.

Page 11: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 341

average surface air-temperature lapse rate (6.0ºC/km), and a modified version of Willmott et al.’s (1985) traditional spherical interpolator. The DEM-assisted, interpo-lated map (Fig. 5) of five-year-average air temperature (T) depicts the large-scale spatial variability in high-latitude average air temperature. It also better represents the smaller-scale variability, especially within mountainous regions, than do tradi-tionally interpolated maps.

Three box plots of the spatial distribution of average T over the Arctic land sur-face are constructed from our station and interpolated spatially (gridded) averages. These box plots are compared to illustrate the effects of using geographic versus tra-ditional box plots to visualize climate data (Fig. 6). Our first box plot is a traditional one, constructed only from our five-year averages at the stations (Fig. 6). It portrays a very warm Arctic land surface (7.4ºC, on average) with limited spatial variability (e.g., the interquartile range is just 5.4ºC). This traditional box plot and its associ-ated statistics are significantly biased, however, because the station network under samples the colder high-latitude and high-elevation environments (Fig. 4). Within weather-station networks, such sampling biases are not uncommon and our Arctic land-surface example illustrates why—when the station values are interpolated onto a regular grid—the gridded values (discussed below) tend to be a better-conditioned sample of the climate field than do the original station values.

Our second box plot (Fig. 6) also is a traditional box plot, but of the gridded five-year averages of T. It depicts a much cooler Arctic land surface (–2.9ºC, on average), with considerably more spatial variability (e.g., the interquartile range is just over 14ºC). These changes are expected, of course, because spatial interpolation to a regular spherical grid tends to reduce the biasing influences of spatially variable

Fig. 6. Traditional box plots of five-year average T at the stations and five-year average T estimated at the grid nodes; geographic box plot of five-year average T estimated at the grid nodes.

Page 12: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

342 WILLMOTT ET AL.

station densities. It is interesting to note that none of the “outliers” identified in the traditional station-network box plot appear as outliers when the station data are gridded. The main problem with this traditional box plot of the gridded T field is that it does not take into account the fact that the size (area) of spherical grid cells decreases with latitude, which induces a cold bias.

Our third box plot (Fig. 6) is a geographic one, which makes the influence of each grid-cell value proportional to the area of the grid cell that it represents. It more correctly depicts a somewhat warmer Arctic land surface (–0.4ºC, on average), with a slightly reduced spatial variability (e.g., the interquartile range is just over 12ºC), because it decreases the relative influence of the smaller, higher-latitude grid cells and increases the relative influence of the larger, lower-latitude grid cells. Within our geographic box plot, only the coldest grid points on the Greenland ice sheet (Fig. 5) are considered to be outliers. Our geographic box plot of the gridded Arctic land-surface T field illustrates why geographic box plots are more appropriate representa-tions of geographic or spatial distributions than are traditional box plots.

CONCLUDING REMARKS

Geographers, climate scientists and other researchers often use “traditional box plots” to graphically represent the distributions of geographic or spatial variables. Traditional box plots, however, sometimes convey biased impressions of spatial dis-tributions, primarily because traditional box plots do not take into account the areas associated with the elements of geographic or spatial variables. Such biases also are common within “traditional” histograms and percentile plots that appear regularly in the geographic and wider scientific literature. Within this article then, we intro-duced and assessed a way in which to construct a spatially weighted box plot, a “geographic box plot,” of virtually any digital geographic variable of interest. It also was suggested that the width of a geographic box plot be made proportional to a measure of reliability, to infer the extent to which the elements of the variable of interest comprise a spatially representative sample. Many of our points about box plots also apply to the construction and interpretation of “geographic” histograms and percentile plots.

Including, within a box plot, the influences of the finite areas associated with a geographic variable’s elements was explored in several ways. Among these were comparisons of geographic and traditional box plots alternately constructed from the same variables. One hypothetical and two real-world comparisons were pre-sented to demonstrate our points. Comparisons of geographic and traditional box plots of human population density within the United States and five-year-average air temperature over the Arctic land surface were made to illustrate the often sub-stantial differences between traditional and geographic box plots of real geographic variables.

Our traditional box plot of population density within the contiguous 48 states overrepresented the small, high-density states and, therefore, substantially overesti-mated the geographic median as well as the mean. Our geographic box plot, on the other hand, accurately portrayed the geographic distribution of population density, as well as its geographic median and mean. Two traditional box plots of the spatial

Page 13: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

GEOGRAPHIC BOX PLOTS 343

distribution of five-year average air temperature over the Arctic land surface were constructed as well. The first one, made from the raw weather-station data, greatly overestimated average air temperature owing to the warm-region sampling bias of the station-observation network; the second one, made from spatially interpolated grid-node values, underestimated average air temperature because the influence of smaller, colder, high-latitude grid cells was tacitly overrepresented. Our third air-temperature box plot, a geographic one, also was made from the interpolated grid-node values; this time, however, the influence of each grid-cell value was made proportional to the area of the grid cell that it represents. Our geographic box plot correctly depicted a slightly warmer Arctic land surface (than did the traditional box plot) with a somewhat reduced spatial variability.

We believe that geographers, climatologists and other scientists will make refine-ments to and find many more applications for our geographic box plots, especially since geographic box plots can be introduced easily into standard GIS and statisti-cal software.

Acknowledgments: Elements of this paper are rooted in concepts previously considered by Willmott and his graduate students. In particular, we are indebted to Mike Rawlins for his efforts to implement an early version of Willmott’s fledgling geographic-box-plot idea and, more recently, to Elsa Nickl and Michelle Johnson for their efforts to adapt the concept to visualizing spatial-resolution errors within gridded precipitation fields. Portions of this work were made possible by NASA Grant NNG06GB54G to the Institute of Global Environment and Society (IGES) and by NSF Grant 0451716, and we are most grateful for this support.

REFERENCES

Anselin, L., Syabri, I., and Kho, Y. (2006) GeoDa: An introduction to spatial data analysis. Geographical Analysis, Vol. 38, 5–22.

Broennimann, O., Thuiller, W., Hughes, G. O., Midgley, G. F., Alkemade, J. R. M., and Guisan, A. (2006) Do geographic distribution, niche property and life form explain plants’ vulnerability to global change? Global Change Biology, Vol. 12, 1079–1093.

Carr, D. B., Olsen, A. R., Pierson, S., and Courbois, J. (1999) Boxplot variations in a spatial context: An Omernik ecoregion and weather Example. Statistical Com-puting and Graphics Newsletter (American Statistical Association, Alexandria, VA), Vol. 9, No. 2, 4–13.

Cleveland, W. S. (1993) Visualizing Data. Summit, NJ: Hobart Press.Cressie, N. A. C. (1993) Statistics for Spatial Data (revised edition). New York, NY:

Wiley.Grotch, S. and MacCracken, M. (1991) The use of general circulation models to

predict regional climate change. Journal of Climate, Vol. 4, 286–303.Haining, R. P. (2003) Spatial Data Analysis: Theory and Practice. Cambridge, UK:

Cambridge University Press.Peterson, T. C. and Vose, R. S. (1997) An overview of the Global Historical Clima-

tology Network temperature database. Bulletin of the American Meteorological Society, Vol. 78, 2837–2849.

Page 14: GEOGRAPHIC BOX PLOTS - Welcome to the University of Delaware

344 WILLMOTT ET AL.

Rawlins, M. A. and Willmott, C. J. (2003) Winter air temperature change over the terrestrial Arctic, 1961–1990. Arctic, Antarctic and Alpine Research, Vol. 35, No. 4, 530–537.

Robeson, S. M. (2002) Increasing growing-season length in Illinois during the 20th century. Climatic Change, Vol. 52, 219–238.

Robeson, S. M. (2004) Trends in time-varying percentiles of daily minimum and maximum temperature over North America. Geophysical Research Letters, Vol. 31, L04203, doi:10.1029/2003GL019019.

Sahr, K., White, D., and Kimerling, A. J. (2003) Geodesic discrete global grid sys-tems. Cartography and Geographic Information Science, Vol. 30, 121–134.

Tukey, J. W. (1977) Exploratory Data Analysis. Reading, MA: Addison-Wesley.White, D., Kimerling, A. J., and Overton, W. S. (1992) Cartographic and geometric

components of a global sampling design for environmental monitoring. Cartog-raphy and Geographic Information Systems, Vol. 19, No. 1, 5–22.

Wilks, D. S. (2006) Statistical Methods in the Atmospheric Sciences (2nd Edition). New York, NY: Academic Press.

Willmott, C. J. and Matsuura, K. (1995) Smart interpolation of annually averaged air temperature in the United States. Journal of Applied Meteorology, Vol. 34, 2577–2586.

Willmott, C. J., Rowe, C. M., and Philpot, W. D. (1985) Small-scale climate maps: A sensitivity analysis of some common assumptions associated with grid-point interpolation and contouring. American Cartographer, Vol. 12, 5–16.

Wise, S., Haining, R., and Ma, J. (2001) Providing spatial statistical data analysis functionality for the GIS user: The SAGE project. International Journal of Geo-graphical Information Science, Vol. 15, No. 3, 239–254.