Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.
-
Upload
chad-barker -
Category
Documents
-
view
213 -
download
0
Transcript of Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.
![Page 1: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/1.jpg)
Exploratory Data Analysis and Data
VisualizationCredits: ChrisVolinsky - Columbia University
1
![Page 2: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/2.jpg)
Outline
• EDA• Visualization
– One variable– Two variables– More than two variables– Other types of data– Dimension reduction
2
![Page 3: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/3.jpg)
EDA and Visualization
• Exploratory Data Analysis (EDA) and Visualization are very important steps in any analysis task.
• get to know your data!– distributions (symmetric, normal, skewed)– data quality problems– outliers– correlations and inter-relationships– subsets of interest– suggest functional relationships
• Sometimes EDA or viz might be the goal!
3
![Page 4: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/4.jpg)
Data Visualization – cake bakery
4
![Page 5: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/5.jpg)
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data – means, medians, quantiles, histograms, boxplots• You should always look at every variable - you will learn
something!• data-driven (model-free)• Think interactive and visual
– Humans are the best pattern recognizers– You can use more than 2 dimensions!
• x,y,z, space, color, time….
• Especially useful in early stages of data mining– detect outliers (e.g. assess data quality)– test assumptions (e.g. normal distributions or skewed?)– identify useful raw data & transforms (e.g. log(x))
• Bottom line: it is always well worth looking at your data!
5
![Page 6: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/6.jpg)
Summary Statistics• not visual• sample statistics of data X
– mean: = i Xi / n – mode: most common value in X– median: X=sort(X), median = Xn/2 (half below, half above)– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n – skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of data is right skewed?)
– number of distinct values for a variable (see unique() in R)
– Don’t need to report all of thses: Bottom line…do these numbers make sense??? 6
![Page 7: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/7.jpg)
Single Variable Visualization• Histogram:
– Shows center, variability, skewness, modality, – outliers, or strange patterns.– Bin width and position matter– Beware of real zeros
7
![Page 8: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/8.jpg)
Issues with Histograms
• For small data sets, histograms can be misleading. – Small changes in the data, bins, or anchor can deceive
• For large data sets, histograms can be quite effective at illustrating general properties of the distribution.
• Histograms effectively only work with 1 variable at a time– But ‘small multiples’ can be effective
8
![Page 9: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/9.jpg)
9
But be careful with axes and scales!
But be careful with axes and scales!
![Page 10: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/10.jpg)
Smoothed Histograms - Density Estimates
10
• Kernel estimates smooth out the contribution of each datapoint over a local neighborhood of that point.
ˆ f (x) 1nh K(
x x i
h)
i1
n
h is the kernel width
• Gaussian kernel is common:2
)(
2
1
h
ixx
Ce
![Page 11: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/11.jpg)
11
Bandwidth choice is an art
Usually want to try several
![Page 12: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/12.jpg)
Boxplots
• Shows a lot of information about a variable in one plot– Median– IQR– Outliers– Range– Skewness
• Negatives– Overplotting – Hard to tell
distributional shape– no standard
implementation in software (many options for whiskers, outliers)
12
![Page 13: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/13.jpg)
Time Series
If your data has a temporal component, be sure to exploit it
13
steady growth trend
New Year bumps
summer peaks
summer bifurcations in air travel (favor early/late)
![Page 14: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/14.jpg)
Time-Series Example 3
Scotland experiment:“ milk in kid diet better health” ?
20,000 kids: 5k raw, 5k pasteurize,
10k control (no supplement)
mean weight vs mean agefor 10k control group
Would expect smooth weight growth plot.
Visually reveals unexpected pattern (steps),
not apparent from raw data table.
Possible explanations:
Grow less early in year than later?
No steps in height plots; so whyheight uniformly, weight spurts?
Kids weighed in clothes: summer garb lighter than winter?
![Page 15: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/15.jpg)
Spatial Data
• If your data has a geographic component, be sure to exploit it
• Data from cities/states/zip cods – easy to get lat/long
• Can plot as scatterplot
15
![Page 16: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/16.jpg)
Spatial data: choropleth Maps
• Maps using color shadings to represent numerical values are called chloropleth maps• http://elections.nytimes.com/2008/results/president/map.html
16
![Page 17: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/17.jpg)
Two Continuous Variables
• For two numeric variables, the scatterplot is the obvious choice
17
interesting?
interesting?
![Page 18: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/18.jpg)
interesting?
interesting?
2D Scatterplots
• standard tool to display relation between 2 variables– e.g. y-axis = response,
x-axis = suspected indicator
• useful to answer:– x,y related?
• linear• quadratic• other
– variance(y) depend on x?
– outliers present?
18
![Page 19: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/19.jpg)
Scatter Plot: No apparent relationship
19
![Page 20: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/20.jpg)
Scatter Plot: Linear relationship
20
![Page 21: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/21.jpg)
Scatter Plot: Quadratic relationship
21
![Page 22: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/22.jpg)
Scatter plot: Homoscedastic
22
Why is this important in classical statistical modelling?
![Page 23: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/23.jpg)
Scatter plot: Heteroscedastic
23
variation in Y differs depending on the value of Xe.g., Y = annual tax paid, X = income
![Page 24: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/24.jpg)
Two variables - continuous
• Scatterplots – But can be bad with lots of data
24
![Page 25: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/25.jpg)
Two variables - continuous
• What to do for large data sets– Contour plots
25
![Page 26: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/26.jpg)
Transparent plottingAlpha-blending:• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)
26
![Page 27: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/27.jpg)
Jittering
• Jittering points helps too• plot(age, TimesPregnant)• plot(jitter(age),jitter(TimesPregnant)
27
![Page 28: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/28.jpg)
Displaying Two Variables
• If one variable is categorical, use small multiples
• Many software packages have this implemented as ‘lattice’ or ‘trellis’ packages
28
library(‘lattice’)histogram(~DiastolicBP | TimesPregnant==0)library(‘lattice’)histogram(~DiastolicBP | TimesPregnant==0)
![Page 29: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/29.jpg)
Two Variables - one categorical
• Side by side boxplots are very effective in showing differences in a quantitative variable across factor levels– tips data
• do men or women tip better
– orchard sprays• measuring potency of various orchard sprays in repelling
honeybees
29
![Page 30: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/30.jpg)
Barcharts and Spineplots
30
stacked barcharts can be used to compare continuous values across two or more categorical ones.
spineplots show proportions well, but can be hard to interpret
orange=M blue=Forange=M blue=F
![Page 31: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/31.jpg)
More than two variables
Pairwise scatterplots
Can be somewhat ineffective for categorical data
31
![Page 32: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/32.jpg)
32
![Page 33: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/33.jpg)
Multivariate: More than two variables
• Get creative!• Conditioning on variables
– trellis or lattice plots– Cleveland models on human perception,
all based on conditioning– Infinite possibilities
• Earthquake data:– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since 1964
– Data collected on the severity of the earthquake
33
![Page 34: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/34.jpg)
34
![Page 35: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/35.jpg)
35
![Page 36: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/36.jpg)
36
How many dimensions are represented here?
How many dimensions are represented here?
Andrew Gelman blog 7/15/2009Andrew Gelman blog 7/15/2009
![Page 37: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/37.jpg)
Multivariate Vis: Parallel Coordinates
37
Petal, a non-reproductive part of the flower
Sepal, a non-reproductive part of the flower
The famous iris data!
![Page 38: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/38.jpg)
Parallel Coordinates
38
Sepal Length
5.1
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.2
![Page 39: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/39.jpg)
Parallel Coordinates: 2 D
39
Sepal Length
5.1
Sepal Width
3.5
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.2
![Page 40: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/40.jpg)
Parallel Coordinates: 4 D
40
Sepal Length
5.1
Sepal Width
Petal length
Petal Width
3.5
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.2
1.4 0.2
![Page 41: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/41.jpg)
Parallel Visualization of Iris data
41
5.1
3.5
1.40.2
![Page 42: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/42.jpg)
Multivariate: Parallel coordinates
42
Courtesy Unwin, Theus, Hofmann
Alpha blending can be effective
![Page 43: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/43.jpg)
Parallel coordinates
• Useful in an interactive setting
43
![Page 44: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/44.jpg)
Networks and Graphs
• Visualizing networks is helpful, even if is not obvious that a network exists
44
![Page 45: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/45.jpg)
Network Visualization
• Graphviz (open source software) is a nice layout tool for big and small graphs
45
![Page 46: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/46.jpg)
What’s missing?
• pie charts– very popular– good for showing simple relations of proportions– Human perception not good at comparing arcs– barplots, histograms usually better (but less pretty)
• 3D– nice to be able to show three dimensions– hard to do well– often done poorly– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D• http://www.stat.tamu.edu/~west/bradley/
46
![Page 47: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/47.jpg)
Worst graphic in the world?
47
![Page 48: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/48.jpg)
Dimension Reduction
• One way to visualize high dimensional data is to reduce it to 2 or 3 dimensions
– Variable selection• e.g. stepwise
– Principle Components• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those similarities
48
More on this in next Topic
![Page 49: Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.](https://reader036.fdocuments.in/reader036/viewer/2022081603/5697bf7d1a28abf838c84b10/html5/thumbnails/49.jpg)
Visualization done right
• Hans Rosling @ TED
• http://www.youtube.com/watch?v=jbkSRLYSojo
49