Post on 04-Feb-2018
Data Visualization and RTheory and ImplementationMay 28, 2013, Koln, Germany
Ryan Womack (rwomack@rutgers.edu)Data Librarian, Rutgers University
IASSIST 2013
This work is licensed under a Creative Commons Attribution
-NonCommercial-ShareAlike 3.0 Unported License.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 1 / 40
Introduction
What this workshop IS:
Focuses on standard techniques of data visualization, theday-to-day power tools for understanding data
Reviews various graphical techniques, from early to recent, fromsimple to advanced
Discusses principles of good data presentation, and show the Rimplementation of many functions
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 2 / 40
Introduction
What this workshop is NOT:
It is not about “infographics”, the beautiful, heavily customizedproducts of expert graphic designers
It is not a complete introduction to R, even though R is used
It is not an introduction to other software beyond R, or the use ofR with scripting languages to produce interactive graphics on theweb
It is not necessarily a balanced survey of all data visualization. Inparticular, it is light on graph networks, clustering, and trees (notmy expertise)
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 3 / 40
Introduction
What you can hope to gain:
Familiarity with the basic principles and history of datavisualization, and recent developments
Exposure to a wide-range of plotting techniques and R packages
A sampling of interactive and big data methods
Understanding of the power and potential of combiningappropriate graphics techniques with data
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 4 / 40
Setup
Workshop materials, including R scripts, supplemental images anddata, are available for download fromhttp://www.rci.rutgers.edu/∼rwomack/IASSIST/Dataviz/
The R script file contains working demonstrations of the conceptsmentioned here.
You will have to install any packages not already on your system.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 5 / 40
Why Data Visualization?
Data visualization can:
provide clear understanding of patterns in data
detect hidden structures in data
condense information
Some examples of good data visualization can be found at:
Information Aesthetics
Chart Porn
Eagereyes
DataVis.ca
Visualizing.org
VizWiz
US Census Data Visualization Gallery
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 6 / 40
Bad Graphs
Pie Charts are known to be problematic
Clutter and other issues can ruin graphics
For more bad ideas, try:
Junk Charts
Ten Worst Graphs
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 7 / 40
R basics
install.packages, read.table, read.csv, library(foreign), library(xlsx)
help via help.start(), ?, library(help=”package”), vignette
Cookbook for R and Quick-R are good jumping off points
Guerilla Guide to R or R cheat sheets or Rutgers R Libguide
search Stack Overflow for R graphics to get specific coding tips ongraphics
R Graph Gallery has many examples of graphs with code, severaladapted for this workshop
Task Views or Crantastic can be use to discover pacakges. Inaddition to help and online documentation, packages are oftendescribed in articles published in places like the Journal ofStatistical Software.
This workshop will allow you to use the R language, but glossesover many details of structure and syntax to focus on the graphicalelements
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 8 / 40
Playfair
Astronomical observations, charts, and maps led in graphicalinnovation prior to 1800.
William Playfair is the pioneer of the line chart, bar chart, timeseries plots, and pie chart.
Playfair, W. (1786). Commercial and Political Atlas: Representing, byCopper-Plate Charts, the Progress of the Commerce, Revenues, Expenditure,and Debts of England, during the Whole of the Eighteenth Century,
Playfair, W. (1801). Statistical Breviary.
Both republished in The Commercial and Political Atlas and StatisticalBreviary, 2005, Cambridge University Press.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 9 / 40
Minard
Charles Joseph Minard was the next influential data graphic creatorafter Playfair.
Minard’s flow map of Napoleon’s Russian campaign is celebratedby Tufte and others as one of the greatest information graphics.
It embodies an ideal of highly compressed informative elements,presented with style
Six variables: size, location in 2 dimensions, the direction of thearmy, temperature, date [and group]
However, this is a one-off design that crosses into Infographics, butit can be reproduced in R and other software.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 10 / 40
Fisher and Tukey
Statisticians such as Ronald Fisher and John Tukey continued toadvance graphical methods for the analysis of data.
Fisher emphasized plotting the data to understand relationships.
Tukey’s Exploratory Data Analysis emphasized the use of graphicsto understand the data during analysis, rather than the finalpresentation to an outside audience.
Tukey created the stem and leaf plot.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 11 / 40
Tufte
Edward R. Tufte’s series of books, beginning with The Visual Displayof Quantitative Information, have become the most widely know workson data visualization.
There is considerable overlap between the various publications
Tufte’s ideal is highly compressed, elegant, and informative data,as expressed in dense printed graphics
Tufte sometimes emphasizes beauty and design to the detriment ofsimplicity and clarity [e.g., train schedules]
“Graphical elegance is often found in simplicity of design andcomplexity of data.”
“Beautiful graphics do not traffic with the trivial.”
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 12 / 40
Tufte’s principles
Tufte has developed and popularized numerous principles andterminology:
Graphics reveal data - show the data without distorting it - “above allelse show the data”
Small multiple - understanding one slice makes understanding otherseasier
Lie factor - effect shown/effect in reality
Graphical Integrity - no lies, let data vary, not design
Data density - maximize data/ink ratio
Sparklines - seems they haven’t caught on
chartjunk - self-explanatory
Powerpoint is responsible for most of the world’s sorrows [TheCognitive Style of Powerpoint]
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 13 / 40
Back to Pies
Why is the pie chart bad?
Low data density
Failure to order numbers along a visual dimension
Perception difficulty in judging area
Stacked bar charts also pose perceptual problems
Or in the terms of Gary Klass:
Data Ambiguity
Data Distortion
Data Distraction
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 14 / 40
Cleveland
William Cleveland’s Elements of Graphing Data and VisualizingData pioneered systematic considerations of data legibility
Cleveland is particularly known for promoting the dot plot as aalternative to bars and pies.
The dot plot provides clarity and easy comparison of data.
Cleveland also pioneered Trellis graphics
Trellis graphics emphasizes comparison of multiple panels of data
The lattice package implements Trellis graphics in R
See Cleveland.pdf for a summary of Cleveland’s recommendations
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 15 / 40
Cleveland - Techniques
TECHNIQUES
logs, % change, residuals
point graph [2d histogram], histogram, percentile graph [and withcomparisons/reference line], box plot [Tukey]
dot charts - best way to attach label to quantity 2-way dot chart{multiway} grouped dot chart
overlap is dealt with by jitter, distinguishable symbols{+sunflowers}, taking log or other transformation
box plots for high multiples
visually distinguish curves and points [this has gotten easy by now]
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 16 / 40
Cleveland - 3 or more variables
THREE OR MORE VARIABLES
Framed-rectangle graphs
scatterplot matrices
interaction/brushing
3d wireframe or stereogram (points)
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 17 / 40
Cleveland - Perception
PERCEPTION
pie v. dot chart
distance and detection
length in a stacked bar
45 degree banking [Tufte also recommends 1.5:1 horizontal tovertical ratio]
strive for clarity
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 18 / 40
lattice
The lattice package implements Trellis graphics in R
lattice excels at comparative plotting
uses similar syntax to base graphics, but with greater sorting andmanipulative power
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 19 / 40
The Grammar of Graphics
The Grammar of Graphics, by Leland Wilkinson, was extremelyinfluential in thinking about graphics
Grammar means ”rules for art and science”
The Grammar of graphics specifies rules both mathematical andaesthetic
Earlier graph producers focused on aesthetics of static content
Dynamic graphics and scientific visualization, by contrast, requiresophisticated designs to enable brushing, drill-down, zooming,linking
The Grammar of Graphics is easily adapted to this approach
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 20 / 40
Grammar of Graphics - pt I
DATA - weighting, reshaping, counting, bootstrapping
VARIABLES - transform, sort, log, ranking, residuals, quantiles
ALGEBRA - nesting or blending data
SCALES - nominal, ordinal, interval, ratio must be specified
STATISTICS - static methods available to all graph types e.g,mean, sd, smoothing
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 21 / 40
Grammar of Graphics - pt II
GEOMETRY - line, area, etc., along with modifiers like jitter anddodge
COORDINATES - refers to the coordinate system of the graph(cartesian, polar, etc.)
AESTHETICS - color, texture, size, position, etc. of the datapoints. Includes using color to classify.
FACETS - subgroups, multiway tables
GUIDES - legends, axes, color scales, keys
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 22 / 40
ggplot2
ggplot2 was developed by Hadley Wickham as an implementationof the Grammar of Graphics
Relatively complete and powerful graphics package
Can do many things, but not 3D
See ggplot2 Help Docs and the ggplot2 book for completedescriptions
Other short introductions to ggplot2 are available, such as thesefrom Sharp Statistics and inundata
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 23 / 40
A Miscellany of Visualizations
The Cleveland dot chart
use to compare labeled quantities, ordered lists
Kernel Density plot
visualize the distribution of data with more precision than ahistogram
Scatterplot Matrix
study relationships between all variable combinations
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 24 / 40
Visualizing Distributions of Data
Box and Whiskers Plot
illustrate quantiles and outliers. There is also a Tufte version.
Stem and Leaf Plot
see precise quantities associated with distribution of data
Violin plot
Blends density information with box and whiskers style (in anartistic manner)
Dot plot
plot distribution point-by-point
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 25 / 40
Categorizing Data
Many techniques are available to automatically identify related data:
A tree illustrates a categorical classification of the data based onits own characteristics
one implementation is the party package
Self-organizing maps are a form of neural network that derivescharacteristics from the data and plots patterns
see kohonen and som packages
Clustering of data can be accomplished by numerous algorithms
hclust and pvclust are some methods described at Quick-R’sCluster Analysis page.
Other graphs and network analysis tools are available [notexplored here]
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 26 / 40
Visualizing Categorical Data
The mosaic plot allows multiple categories to be displayed on thesame graph, but can be complicated to interpret.
The spineplot is a variant of the mosaic plot, plotting proportionsin 2 dimensions.
You can also do a timeline.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 27 / 40
Maps and Glyphs
Maps are obviously an important and widespread way of presentingdata.
We examine a few examples of choropleth maps, in which shadingindicates data levels
Glyphs present iconic representations of data elements as plottedpoints.
A dynamic example is here.
As an R example, consider Chernoff faces and the aplpack
package. Also, Smiley faces.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 28 / 40
3-D
3-D scatterplots
cloud (lattice)
contour plots
to plot standardized levels of data
wireframe plots
to present a 3-D surface representation of data
rgl (a separate package containing several 3d plotting functionsand animation)
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 29 / 40
Interactive DataViz - Principles
Why aren’t all of our graphs interactive?
Brushing is used to select data points and track them throughvarious analyses.
Drilling down, zooming, and subsetting are also interactivetechniques.
Data displays can be linked so that a selection in one panelmodifies the output displayed in another panel.
Interactivity is especially useful for data exploration, studyingmultidimensional relationships.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 30 / 40
Linked Data Panels and Vis packages
In many contexts, visualizing the relationships between data elementsis made easier by viewing related data panels simultaneously.
One example of this occurs in time series data with decompositioninto trend, seasonal, and random components
The tableplot (tabplot package) implements another linked dataview across all variables.
googleVis and other “Vis” packages, e.g. healthvis.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 31 / 40
Interactive Data in Practice
There are many R packages that allow for interactive data work in agraphical user interface, including:
playwith - versatile package that works with any graphicsfunction. Graphics can be explored, edited, and exported.
requires separate installation of GTK+ on your computer [method variesby OS]
rggobi - powerful 3-D tool for brushing, identifying andmanipulating data with a book and online companion site.
requires separate installation of GGobi on your computer [method variesby OS]
rattle - package designed for data mining, includes graphicsoptions alongside other statistical functions
latticist - allows complex linking of plots
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 32 / 40
Big Data
Big data presents special issues for data visualization
While many techniques and graphics are the same, explorationand plotting must be optimized for the size of the data set
Representation of the complexity of the data may require specialtechniques
hexbin
bigvis
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 33 / 40
Airline Data
For this exercise, we will use Airline on-time performance data.
This data contains information on every flight in the United Statesfrom 1987 to 2008, including arrival and departure times, delays,and other attributes.
The link above allows access to the full dataset. These are large.For example, the full 2008 data contains 7,009,728 records and is689 MB.
We will use an extract of selected variables from January 2008only. This subset contains 605,765 records and is 37.5 MB.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 34 / 40
Binning
hexbin resolves overplotting issues in large data sets by showingdensity
There are many other binning methods
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 35 / 40
bigvis
bigvis is a very new package by Hadley Wickham to deal with theissues of Big Data
There is a Preprint and R Meetup presentation by Hadley Wickham
Complete code, including the extracts adapted for this workshop, isavailable at https://github.com/hadley/bigvis-infovis
Target: process 100 million observations in under 5 seconds.
Fundamental principle: No need for more data points than there arepixels on the screen.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 36 / 40
bigvis steps
Condense (bin, condense)
Smooth (smooth, best_h, peel)
Visualize (autoplot plus standard methods)
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 37 / 40
Infographics links
Although not covered here, the following links are a sampling ofinfographics sites for your later enjoyment:
Data Storytelling in Video
Art of Data Visualization - in spite of its title, more on theinfographics side
Parisian Subway Traffic and New York Subway Inequality
Tulp Interactive
Information Aesthetics
Mapping London and London Riots + Twitter
YouTube Trends Map
Global Burden of Disease Visualizations
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 38 / 40
Other considerations
These are not illustrated in the code, but represent future topics forexploration.
confidence intervals
missing data [discuss]
color can be used to indicate certainty
“scagnostics”, borderlining scatterplots (scagnostics packages)
edge blur, crisp vs. fuzzy edges
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 39 / 40
References
There is also an online bibliography of references to accompany thispresentation on my home page.
Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers UniversityData Visualization and R IASSIST 2013 40 / 40