Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization...

57
Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for Data Analysis: Cook and Swayne Padhraic Smyth’s UCI lecture notes R Graphics: Paul Murrell Graphics of Large Datasets: Visualizing a Milion: Unwin, Theus and Hofmann 1

Transcript of Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization...

Page 1: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Exploratory Data Analysis and Data

Visualization

Chapter 2credits:

Interactive and Dyamic Graphics for Data Analysis: Cook and SwaynePadhraic Smyth’s UCI lecture notes

R Graphics: Paul MurrellGraphics of Large Datasets: Visualizing a Milion: Unwin, Theus and Hofmann

1

Page 2: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Outline

• EDA• Visualization

– One variable– Two variables– More than two variables– Other types of data– Dimension reduction

2

Page 3: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

EDA and Visualization

• Exploratory Data Analysis (EDA) and Visualization are important (necessary?) steps in any analysis task.

• get to know your data!– distributions (symmetric, normal, skewed)– data quality problems– outliers– correlations and inter-relationships– subsets of interest– suggest functional relationships

• Sometimes EDA or viz might be the goal!

3

Page 4: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 4

flowingdata.com 9/9/11flowingdata.com 9/9/11

Page 5: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 5

NYTimes 7/26/11NYTimes 7/26/11

Page 6: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Exploratory Data Analysis (EDA)

• Goal: get a general sense of the data – means, medians, quantiles, histograms, boxplots• You should always look at every variable - you will

learn something!• data-driven (model-free)• Think interactive and visual

– Humans are the best pattern recognizers– You can use more than 2 dimensions!

• x,y,z, space, color, time….• especially useful in early stages of data mining

– detect outliers (e.g. assess data quality)– test assumptions (e.g. normal distributions or skewed?)– identify useful raw data & transforms (e.g. log(x))

• Bottom line: it is always well worth looking at your data!

6

Page 7: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Summary Statistics• not visual• sample statistics of data X

– mean: = i Xi / n – mode: most common value in X– median: X=sort(X), median = Xn/2 (half below, half

above)– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n

• interquartile range: value(Q3) - value(Q1)• range: max(X) - min(X) = Xn - X1

– variance: 2 = i (Xi - )2 / n – skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]

• zero if symmetric; right-skewed more common (what kind of data is right skewed?)

– number of distinct values for a variable (see unique() in R)

– Don’t need to report all of thses: Bottom line…do these numbers make sense???

7

Page 8: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Single Variable Visualization• Histogram:

– Shows center, variability, skewness, modality, – outliers, or strange patterns.– Bins matter– Beware of real zeros

8

Page 9: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Issues with Histograms

• For small data sets, histograms can be misleading. – Small changes in the data, bins, or anchor can deceive

• For large data sets, histograms can be quite effective at illustrating general properties of the distribution.

• Histograms effectively only work with 1 variable at a time– But ‘small multiples’ can be effective

9

Page 10: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 10

But be careful with axes and scales!

But be careful with axes and scales!

Page 11: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Smoothed Histograms - Density Estimates

• Kernel estimates smooth out the contribution of each datapoint over a local neighborhood of that point.

ˆ f (x) = 1nh K(

x − x i

h)

i=1

n

h is the kernel width

• Gaussian kernel is common:2

)(

2

1⎟⎠⎞

⎜⎝⎛ −

−h

ixx

Ce

11

Page 12: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 12

Bandwidth choice is an art

Usually want to try several

Page 13: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Boxplots

• Shows a lot of information about a variable in one plot– Median– IQR– Outliers– Range– Skewness

• Negatives– Overplotting – Hard to tell

distributional shape– no standard

implementation in software (many options for whiskers, outliers)

13

Page 14: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Time Series

If your data has a temporal component, be sure to exploit it

Data Mining 2011 - Volinsky - Columbia University 14

steady growth trend

New Year bumps

summer peaks

summer bifurcations in air travel (favor early/late)

Page 15: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Spatial Data

• If your data has a geographic component, be sure to exploit it

• Data from cities/states/zip cods – easy to get lat/long

• Can plot as scatterplot

Data Mining 2011 - Volinsky - Columbia University 15

Page 16: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Spatio-temporal data

• spatio-temporal data– http://projects.flowingdata.com/walmart/

(Nathan Yau)

– But, fancy tools not needed! Just do successive scatterplots to (almost) the same effect

16

Page 17: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Spatial data: choropleth Maps

• Maps using color shadings to represent numerical values are called chloropleth maps• http://elections.nytimes.com/2008/results/president/map.html

Data Mining 2011 - Volinsky - Columbia University 17

Page 18: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Two Continuous Variables

• For two numeric variables, the scatterplot is the obvious choice

interesting?

interesting?

18

Page 19: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

interesting?

interesting?

2D Scatterplots

• standard tool to display relation between 2 variables– e.g. y-axis = response,

x-axis = suspected indicator

• useful to answer:– x,y related?

• linear• quadratic• other

– variance(y) depend on x?

– outliers present?

Data Mining 2011 - Volinsky - Columbia University 19

Page 20: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Scatter Plot: No apparent relationship

20

Page 21: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Scatter Plot: Linear relationship

21

Page 22: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Scatter Plot: Quadratic relationship

22

Page 23: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Scatter plot: Homoscedastic

Why is this important in classical statistical modelling?

23

Page 24: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Scatter plot: Heteroscedastic

variation in Y differs depending on the value of Xe.g., Y = annual tax paid, X = income

24

Page 25: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Two variables - continuous

• Scatterplots – But can be bad with lots of data

25

Page 26: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

• What to do for large data sets– Contour plots

Two variables - continuous

26

Page 27: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Transparent plottingAlpha-blending:• plot( rnorm(1000), rnorm(1000), col="#0000ff22",

pch=16,cex=3)

27

Page 28: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Alpha blending

courtesy Simon Urbanekcourtesy Simon Urbanek

28

Page 29: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Jittering

• Jittering points helps too• plot(age, TimesPregnant)• plot(jitter(age),jitter(TimesPregnant)

29

Page 30: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Displaying Two Variables

• If one variable is categorical, use small multiples

• Many software packages have this implemented as ‘lattice’ or ‘trellis’ packages

30

library(‘lattice’)histogram(~DiastolicBP | TimesPregnant==0)library(‘lattice’)histogram(~DiastolicBP | TimesPregnant==0)

Page 31: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Two Variables - one categorical

• Side by side boxplots are very effective in showing differences in a quantitative variable across factor levels– tips data

• do men or women tip better

– orchard sprays• measuring potency of various orchard sprays in repelling

honeybees

31

Page 32: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Barcharts and Spineplots

stacked barcharts can be used to compare continuous values across two or more categorical ones.

spineplots show proportions well, but can be hard to interpret 32

orange=M blue=Forange=M blue=F

Page 33: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

More than two variables

Pairwise scatterplots

Can be somewhat ineffective for categorical data

33

Page 34: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 34

Page 35: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Multivariate: More than two variables

• Get creative!• Conditioning on variables

– trellis or lattice plots– Cleveland models on human perception,

all based on conditioning– Infinite possibilities

• Earthquake data:– locations of 1000 seismic events of MB > 4.0.

The events occurred in a cube near Fiji since 1964

– Data collected on the severity of the earthquake

35

Page 36: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 36

Page 37: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 37

Page 38: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 38

How many dimensions are represented here?

How many dimensions are represented here?

Andrew Gelman blog 7/15/2009Andrew Gelman blog 7/15/2009

Page 39: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

39

Multivariate Vis: Parallel Coordinates

Petal, a non-reproductive part of the flower

Sepal, a non-reproductive part of the flower

The famous iris data!

Data Mining 2011 - Volinsky - Columbia University

Page 40: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

40

Parallel Coordinates

Sepal Length

5.1

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.2

Data Mining 2011 - Volinsky - Columbia University

Page 41: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

41

Parallel Coordinates: 2 D

Sepal Length

5.1

Sepal Width

3.5

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.2

Data Mining 2011 - Volinsky - Columbia University

Page 42: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

42

Parallel Coordinates: 4 D

Sepal Length

5.1

Sepal Width

Petal length

Petal Width

3.5

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.2

1.4 0.2

Data Mining 2011 - Volinsky - Columbia University

Page 43: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

43

5.1

3.5

1.40.2

Parallel Visualization of Iris data

Data Mining 2011 - Volinsky - Columbia University

Page 44: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Multivariate: Parallel coordinates

Data Mining 2011 - Volinsky - Columbia University

Courtesy Unwin, Theus, Hofmann

Alpha blending can be effective

44

Page 45: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Parallel coordinates

• Useful in an interactive setting

Data Mining 2011 - Volinsky - Columbia University 45

Page 46: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Starplots

46

Page 47: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Using Icons to Encode Information, e.g., Star Plots

• Each star represents a single observation. Star plots are used to examine the relative values for a single data point

• The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables.

• Useful for small data sets with up to 10 or so variables

• Limitations?– Small data sets, small dimensions– Ordering of variables may affect

perception

1 Price 2 Mileage (MPG) 3 1978 Repair Record (1 = Worst, 5 =

Best) 4 1977 Repair Record (1 = Worst, 5 =

Best)

5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight

9 Length

47

Page 48: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Chernoff’s Faces

• described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening

• Much derided in statistical circles

48

Page 49: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Chernoff faces

49

Page 50: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Mosaic Plots• generalization of spine plots for many categorical variables• sensitive to the order which they are applied

•Titanic Data:

50

Page 51: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Mosaic plots

Data Mining 2011 - Volinsky - Columbia University

Can be effective, but can get out of hand:

51

Page 52: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Networks and Graphs

• Visualizing networks is helpful, even if is not obvious that a network exists

52

Page 53: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Network Visualization

• Graphviz (open source software) is a nice layout tool for big and small graphs

Data Mining 2011 - Volinsky - Columbia University 53

Page 54: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

What’s missing?

• pie charts– very popular– good for showing simple relations of proportions– Human perception not good at comparing arcs– barplots, histograms usually better (but less pretty)

• 3D– nice to be able to show three dimensions– hard to do well– often done poorly– 3d best shown through “spinning” in 2D

• uses various types of projecting into 2D• http://www.stat.tamu.edu/~west/bradley/

54

Page 55: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University 55

Worst graphic in the world?

Page 56: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Data Mining 2011 - Volinsky - Columbia University

Dimension Reduction

• One way to visualize high dimensional data is to reduce it to 2 or 3 dimensions

– Variable selection• e.g. stepwise

– Principle Components• find linear projection onto p-space with maximal

variance

– Multi-dimensional scaling• takes a matrix of (dis)similarities and embeds the

points in p-dimensional space to retain those similaritiesMore on this in next Topic

56

Page 57: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Visualization done right

• Hans Rosling @ TED

• http://www.youtube.com/watch?v=jbkSRLYSojo

Data Mining 2011 - Volinsky - Columbia University 57