Introduction to Big Data. Reference: What is “Big Data”?What is “Big Data”?
Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...
Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...
Introductionto Big Data
Chapter 10 (Week 6)Exploratory Data Analysis (Visualization)
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
Contents
Summary Statistic
Exploratory Data Analysis1. Visualization
Diverse plots
Single Variable Visualization2.
Diverse plots
Visualization for Two Variables 3.
Diverse plots
Visualization for More than Two Variables 4.
01Exploratory Data AnalysisSummary statistic & Visualization
copyrightⓒ 2018 All rights reserved by Korea University 4
EDA and VisualizationExploratory Data Analysis
Exploratory Data Analysis (EDA) and Visualization are very important steps in any analysis task.
Get to know your data! distributions (symmetric, normal, skewed) data quality problems outliers correlations and inter-relationships subsets of interest suggest functional relationships
Sometimes EDA or visualization could be the goal!
copyrightⓒ 2018 All rights reserved by Korea University 5
Exploratory Data AnalysisDefinition of EDA
Goal: Get a general sense of the data means, medians, quantiles, histograms, boxplots You should always look at every variable - you will learn
something!
Think interactive and visual Humans are the best pattern recognizers You can use more than 2 dimensions!
x,y,z, space, color, time….
Especially useful in early stages of data mining Detect outliers (e.g. assess data quality) Test assumptions (e.g. normal distributions or skewed?) Identify useful raw data & transforms (e.g. log(x))
Bottom line: it is always well worth looking at your data!
copyrightⓒ 2018 All rights reserved by Korea University 6
Exploratory Data AnalysisSummary Statistic
Summary statistic is not visualization Sample statistics of data X
Mean: �̅�𝑥= ∑i Xi / n Mode: most common value in X Median: X=sort(X), median = Xn/2 (half below, half above) Quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
Interquartile range: value(Q3) - value(Q1)Range: max(X) - min(X) = Xn - X1
Variance: σ2 = ∑i (Xi - �̅�𝑥)2 / n Skewness: ∑i (Xi - �̅�𝑥)3 / [ (∑i (Xi - �̅�𝑥)2)3/2 ]
Zero if symmetric; right-skewed more common Number of distinct values for a variable
copyrightⓒ 2018 All rights reserved by Korea University 7
Exploratory Data AnalysisInformation Visualization
Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).
Visualization: converting raw data to a form that is viewable and understandable to humans.
copyrightⓒ 2018 All rights reserved by Korea University 8
Exploratory Data AnalysisInformation Visualization
Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).
copyrightⓒ 2018 All rights reserved by Korea University 9
Visual Encoding VariablesImportant components in visualization
Position Length Area Volume Value Texture Color Shape Transparency Blur / Focus...
copyrightⓒ 2018 All rights reserved by Korea University 10
Information in Hue and ValueImportant components in visualization
Value is perceived as ordered
Encode ordinal variables (O)
Encode continuous variables (Q)
Encoce nominal variables (N) using colorHue is normally perceived as unordered
copyrightⓒ 2018 All rights reserved by Korea University 11
Bertin’s Levels of OrganizationImportant components in visualization
copyrightⓒ 2018 All rights reserved by Korea University 12
Effectiveness RankingImportant components in visualization
copyrightⓒ 2018 All rights reserved by Korea University 13
Effectiveness RankingImportant components in visualization
By using the key elements of this visualization, you canaccelerate the transformation of information into knowledge.
02Single Variable VisualizationSingle!!!
copyrightⓒ 2018 All rights reserved by Korea University 15
HistogramSingle Variable Visualization
Shows center, variability, skewness, modality, outliers, or strange patterns.
Bin width and position matter
Beware of real zeros
copyrightⓒ 2018 All rights reserved by Korea University 16
Pictures of Data: Continuous VariablesHow to make a Histogram
Consider the following data collected from the 1995 StatisticalAbstracts of the United States
• For each of the 50 United States, the proportion ofindividuals over 65 years of age has been recorded
copyrightⓒ 2018 All rights reserved by Korea University 17
Pictures of Data: Continuous VariablesHow to make a Histogram
Let’s find out Max and Min values
copyrightⓒ 2018 All rights reserved by Korea University 18
Pictures of Data: Continuous VariablesHow to make a Histogram
Break the data range into mutually exclusive, equally sized “bins”:This example used 1% wide.
Let’s count the number of observations in each bin
copyrightⓒ 2018 All rights reserved by Korea University 19
Pictures of Data: Continuous VariablesDrawing the histogram based on these information
copyrightⓒ 2018 All rights reserved by Korea University 20
Pictures of Data: HistogramsAnother example
Suppose we have a sample of blood pressure data on a sampleof 113 men
Sample mean (�̅�𝑥) : 123.6 mmHg
Sample Median (Med): 123.0 mmHg
Sample sd (s): 12.9 mmHg
copyrightⓒ 2018 All rights reserved by Korea University 21
Pictures of Data: Continuous VariablesDrawing the histogram based on these information
copyrightⓒ 2018 All rights reserved by Korea University 22
Pictures of Data: Continuous VariablesDifferent bin?
copyrightⓒ 2018 All rights reserved by Korea University 23
Pictures of Data: Continuous VariablesDifferent bin?
copyrightⓒ 2018 All rights reserved by Korea University 24
Importance of IntervalsBin size
How many intervals (bins) should you have in a histogram?
• There is no perfect answer to this
• Depends on sample size n
• Rough rule of thumb: # Intervals ≈ 𝑛𝑛
copyrightⓒ 2018 All rights reserved by Korea University 25
Issues with HistogramSingle Variable Visualization
For small data sets, histograms can be misleading. Small changes in the data, bins, or anchor can deceive
For large data sets, histograms can be quite effective atillustrating general properties of the distribution.
Histograms effectively only work with 1 variable at a time.
copyrightⓒ 2018 All rights reserved by Korea University 26
BoxplotsSingle Variable Visualization
Shows a lot of information about a variable in one plot Median IQR Outliers Range Skewness
Limitations Overplotting It is hard to tell distributional
shape No standard implementation
in software (many options for whiskers, outliers)
copyrightⓒ 2018 All rights reserved by Korea University 27
BoxplotsSingle Variable Visualization
copyrightⓒ 2018 All rights reserved by Korea University 28
BoxplotsSingle Variable Visualization
SampleMedian
copyrightⓒ 2018 All rights reserved by Korea University 29
BoxplotsSingle Variable Visualization
75th Percentile
25th Percentile
copyrightⓒ 2018 All rights reserved by Korea University 30
BoxplotsSingle Variable Visualization
LargestObs.
SmallestObs.
copyrightⓒ 2018 All rights reserved by Korea University 31
Example) Hospital length of stay dataBoxplot
LargeOutliers
copyrightⓒ 2018 All rights reserved by Korea University 32
Text cloudSingle Categorical Variable Visualization
copyrightⓒ 2018 All rights reserved by Korea University 33
Sequence LogoSingle Categorical Variable Visualization
copyrightⓒ 2018 All rights reserved by Korea University 34
Network plot between wordsSingle Categorical Variable Visualization
03Visualization for Two VariablesTwo variables
copyrightⓒ 2018 All rights reserved by Korea University 36
ScatterplotsFor two continuous variables
copyrightⓒ 2018 All rights reserved by Korea University 37
ScatterplotsFor two continuous variables
Standard tool to display relation between two continuousvariables.
Useful to answer Are X and Y related each other?
Linear Quadratic Other
Variance of Y variable depend on X? Outliers present?
copyrightⓒ 2018 All rights reserved by Korea University 38
ScatterplotsFor two continuous variables
Is there a relationship between X and Y variables?
copyrightⓒ 2018 All rights reserved by Korea University 39
ScatterplotsFor two continuous variables
Is there a relationship between X and Y variables?
copyrightⓒ 2018 All rights reserved by Korea University 40
ScatterplotsFor two continuous variables
Is there a relationship between X and Y variables?
copyrightⓒ 2018 All rights reserved by Korea University 41
ScatterplotsFor two continuous variables
Is there a relationship between X and Y variables?
copyrightⓒ 2018 All rights reserved by Korea University 42
ScatterplotsFor two continuous variables
Is there a relationship between X and Y variables?
Variation in Y differs depending on the value of X.
copyrightⓒ 2018 All rights reserved by Korea University 43
ScatterplotsFor two continuous variables
Limitation
It is very difficult to represent a lot of data at once.
copyrightⓒ 2018 All rights reserved by Korea University 44
Contour plotsFor two continuous variables (Large scale data)
Contour plots are great for representing relationships betweentwo continuous variables.
It doesn’t give you the exact location of each value, but you cansee the relationship and density between two variables.
copyrightⓒ 2018 All rights reserved by Korea University 45
Two techniques in visualizationFor large scale data
Transparent plotting
Jittering
copyrightⓒ 2018 All rights reserved by Korea University 46
Histogram with different colorOne continuous and cateogircal variables
If one variable is categorical, we can use small multiples.
‘Color’ and ‘Shape’ can visually represent different category!
copyrightⓒ 2018 All rights reserved by Korea University 47
Side-by-side boxplotOne continuous and cateogircal variables
Box-plot can likewise represent a single categorical variables asindependent boxes.
copyrightⓒ 2018 All rights reserved by Korea University 48
Barcharts and SpineplotsOne continuous and cateogircal variables
Stacked barcharts can be used to compare continuous valuesacross two or more categorical ones.
copyrightⓒ 2018 All rights reserved by Korea University 49
Pie chartsOne continuous and cateogircal variables
Very popular visualization way.
This is good for showing simple relations of proportions.
Barplots, histograms usually better (but less pretty)
04VisualizationMore than two variables
copyrightⓒ 2018 All rights reserved by Korea University 51
Pairwise scatterplotsMore than two variables
Pairwise scatterplots can represent multiple variables at once.
However, there can be difficulties in expressing categoricalvariables.
copyrightⓒ 2018 All rights reserved by Korea University 52
Multivariate visualizationMore than two variables
Creative thinking will be required to visualize multiple variables atthe same time.
Conditioning on variables
Trellis or lattice plots Different colors and shapes Infinite possibilities
copyrightⓒ 2018 All rights reserved by Korea University 53
Simple questionMore than two variables
How many dimensions are represented here?
copyrightⓒ 2018 All rights reserved by Korea University 54
Parallel Coordinate plotMore than two variables
copyrightⓒ 2018 All rights reserved by Korea University 55
Networks plotMore than two variables
Visualizaing networks is helpful, even if is not obvious that anetwork exists.
copyrightⓒ 2018 All rights reserved by Korea University 56
HeatmapMore than two variables
Heatmaps are one of the widely used visualization methods.
copyrightⓒ 2018 All rights reserved by Korea University 57
InteractivityImportant visualization aporaches these days
As the world-wide web is becoming more common these days,the importance of interaction is growing in the fild of visualization.
Demo
copyrightⓒ 2018 All rights reserved by Korea University 58
Multi-dimentional plotMore than two variables
One variable represents one dimension.
To represent three variables at once, three-dimensional space isrequird.
copyrightⓒ 2018 All rights reserved by Korea University 59
Multi-dimentional plotMore than two variables
Multi-dimentional plot seems to be fancy.
But there is a practical issue due to the ‘Viewpoint’.
We end up visualizing in 2d spaces like apper or web page.
https://plot.ly/r/3d-scatter-plots/
copyrightⓒ 2018 All rights reserved by Korea University 60
Dimension reductionVisualization
What if you need to visualize more than 1000 variables?
One possible way is to visualize high dimensional data is toreduce it to 2 or 3 dimensions.
Variable selection (Feature selection)
Principle Components (PCA analysis)
Multi-dimensional scaling
This is next topic!
End of Slide