Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...

61
Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6)...

Page 1: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

Introductionto Big Data

Chapter 10 (Week 6)Exploratory Data Analysis (Visualization)

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Page 2: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

Contents

Summary Statistic

Exploratory Data Analysis1. Visualization

Diverse plots

Single Variable Visualization2.

Diverse plots

Visualization for Two Variables 3.

Diverse plots

Visualization for More than Two Variables 4.

Page 3: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

01Exploratory Data AnalysisSummary statistic & Visualization

Page 4: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 4

EDA and VisualizationExploratory Data Analysis

Exploratory Data Analysis (EDA) and Visualization are very important steps in any analysis task.

Get to know your data! distributions (symmetric, normal, skewed) data quality problems outliers correlations and inter-relationships subsets of interest suggest functional relationships

Sometimes EDA or visualization could be the goal!

Page 5: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 5

Exploratory Data AnalysisDefinition of EDA

Goal: Get a general sense of the data means, medians, quantiles, histograms, boxplots You should always look at every variable - you will learn

something!

Think interactive and visual Humans are the best pattern recognizers You can use more than 2 dimensions!

x,y,z, space, color, time….

Especially useful in early stages of data mining Detect outliers (e.g. assess data quality) Test assumptions (e.g. normal distributions or skewed?) Identify useful raw data & transforms (e.g. log(x))

Bottom line: it is always well worth looking at your data!

Page 6: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 6

Exploratory Data AnalysisSummary Statistic

Summary statistic is not visualization Sample statistics of data X

Mean: �̅�𝑥= ∑i Xi / n Mode: most common value in X Median: X=sort(X), median = Xn/2 (half below, half above) Quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n

Interquartile range: value(Q3) - value(Q1)Range: max(X) - min(X) = Xn - X1

Variance: σ2 = ∑i (Xi - �̅�𝑥)2 / n Skewness: ∑i (Xi - �̅�𝑥)3 / [ (∑i (Xi - �̅�𝑥)2)3/2 ]

Zero if symmetric; right-skewed more common Number of distinct values for a variable

Page 7: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 7

Exploratory Data AnalysisInformation Visualization

Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).

Visualization: converting raw data to a form that is viewable and understandable to humans.

Page 8: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 8

Exploratory Data AnalysisInformation Visualization

Information visualization: concerned with data that does not have a well-defined representation in 2D or 3D space (i.e., “abstract data”).

Page 9: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 9

Visual Encoding VariablesImportant components in visualization

Position Length Area Volume Value Texture Color Shape Transparency Blur / Focus...

Page 10: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 10

Information in Hue and ValueImportant components in visualization

Value is perceived as ordered

Encode ordinal variables (O)

Encode continuous variables (Q)

Encoce nominal variables (N) using colorHue is normally perceived as unordered

Page 11: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 11

Bertin’s Levels of OrganizationImportant components in visualization

Page 12: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 12

Effectiveness RankingImportant components in visualization

Page 13: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 13

Effectiveness RankingImportant components in visualization

By using the key elements of this visualization, you canaccelerate the transformation of information into knowledge.

Page 14: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

02Single Variable VisualizationSingle!!!

Page 15: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 15

HistogramSingle Variable Visualization

Shows center, variability, skewness, modality, outliers, or strange patterns.

Bin width and position matter

Beware of real zeros

Page 16: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 16

Pictures of Data: Continuous VariablesHow to make a Histogram

Consider the following data collected from the 1995 StatisticalAbstracts of the United States

• For each of the 50 United States, the proportion ofindividuals over 65 years of age has been recorded

Page 17: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 17

Pictures of Data: Continuous VariablesHow to make a Histogram

Let’s find out Max and Min values

Page 18: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 18

Pictures of Data: Continuous VariablesHow to make a Histogram

Break the data range into mutually exclusive, equally sized “bins”:This example used 1% wide.

Let’s count the number of observations in each bin

Page 19: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 19

Pictures of Data: Continuous VariablesDrawing the histogram based on these information

Page 20: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 20

Pictures of Data: HistogramsAnother example

Suppose we have a sample of blood pressure data on a sampleof 113 men

Sample mean (�̅�𝑥) : 123.6 mmHg

Sample Median (Med): 123.0 mmHg

Sample sd (s): 12.9 mmHg

Page 21: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 21

Pictures of Data: Continuous VariablesDrawing the histogram based on these information

Page 22: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 22

Pictures of Data: Continuous VariablesDifferent bin?

Page 23: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 23

Pictures of Data: Continuous VariablesDifferent bin?

Page 24: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 24

Importance of IntervalsBin size

How many intervals (bins) should you have in a histogram?

• There is no perfect answer to this

• Depends on sample size n

• Rough rule of thumb: # Intervals ≈ 𝑛𝑛

Page 25: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 25

Issues with HistogramSingle Variable Visualization

For small data sets, histograms can be misleading. Small changes in the data, bins, or anchor can deceive

For large data sets, histograms can be quite effective atillustrating general properties of the distribution.

Histograms effectively only work with 1 variable at a time.

Page 26: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 26

BoxplotsSingle Variable Visualization

Shows a lot of information about a variable in one plot Median IQR Outliers Range Skewness

Limitations Overplotting It is hard to tell distributional

shape No standard implementation

in software (many options for whiskers, outliers)

Page 27: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 27

BoxplotsSingle Variable Visualization

Page 28: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 28

BoxplotsSingle Variable Visualization

SampleMedian

Page 29: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 29

BoxplotsSingle Variable Visualization

75th Percentile

25th Percentile

Page 30: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 30

BoxplotsSingle Variable Visualization

LargestObs.

SmallestObs.

Page 31: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 31

Example) Hospital length of stay dataBoxplot

LargeOutliers

Page 32: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 32

Text cloudSingle Categorical Variable Visualization

Page 33: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 33

Sequence LogoSingle Categorical Variable Visualization

Page 34: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 34

Network plot between wordsSingle Categorical Variable Visualization

Page 35: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

03Visualization for Two VariablesTwo variables

Page 36: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 36

ScatterplotsFor two continuous variables

Page 37: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 37

ScatterplotsFor two continuous variables

Standard tool to display relation between two continuousvariables.

Useful to answer Are X and Y related each other?

Linear Quadratic Other

Variance of Y variable depend on X? Outliers present?

Page 38: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 38

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Page 39: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 39

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Page 40: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 40

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Page 41: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 41

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Page 42: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 42

ScatterplotsFor two continuous variables

Is there a relationship between X and Y variables?

Variation in Y differs depending on the value of X.

Page 43: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 43

ScatterplotsFor two continuous variables

Limitation

It is very difficult to represent a lot of data at once.

Page 44: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 44

Contour plotsFor two continuous variables (Large scale data)

Contour plots are great for representing relationships betweentwo continuous variables.

It doesn’t give you the exact location of each value, but you cansee the relationship and density between two variables.

Page 45: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 45

Two techniques in visualizationFor large scale data

Transparent plotting

Jittering

Page 46: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 46

Histogram with different colorOne continuous and cateogircal variables

If one variable is categorical, we can use small multiples.

‘Color’ and ‘Shape’ can visually represent different category!

Page 47: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 47

Side-by-side boxplotOne continuous and cateogircal variables

Box-plot can likewise represent a single categorical variables asindependent boxes.

Page 48: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 48

Barcharts and SpineplotsOne continuous and cateogircal variables

Stacked barcharts can be used to compare continuous valuesacross two or more categorical ones.

Page 49: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 49

Pie chartsOne continuous and cateogircal variables

Very popular visualization way.

This is good for showing simple relations of proportions.

Barplots, histograms usually better (but less pretty)

Page 50: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

04VisualizationMore than two variables

Page 51: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 51

Pairwise scatterplotsMore than two variables

Pairwise scatterplots can represent multiple variables at once.

However, there can be difficulties in expressing categoricalvariables.

Page 52: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 52

Multivariate visualizationMore than two variables

Creative thinking will be required to visualize multiple variables atthe same time.

Conditioning on variables

Trellis or lattice plots Different colors and shapes Infinite possibilities

Page 53: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 53

Simple questionMore than two variables

How many dimensions are represented here?

Page 54: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 54

Parallel Coordinate plotMore than two variables

Page 55: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 55

Networks plotMore than two variables

Visualizaing networks is helpful, even if is not obvious that anetwork exists.

Page 56: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 56

HeatmapMore than two variables

Heatmaps are one of the widely used visualization methods.

Page 57: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 57

InteractivityImportant visualization aporaches these days

As the world-wide web is becoming more common these days,the importance of interaction is growing in the fild of visualization.

Demo

Page 58: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 58

Multi-dimentional plotMore than two variables

One variable represents one dimension.

To represent three variables at once, three-dimensional space isrequird.

Page 59: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 59

Multi-dimentional plotMore than two variables

Multi-dimentional plot seems to be fancy.

But there is a practical issue due to the ‘Viewpoint’.

We end up visualizing in 2d spaces like apper or web page.

https://plot.ly/r/3d-scatter-plots/

Page 60: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

copyrightⓒ 2018 All rights reserved by Korea University 60

Dimension reductionVisualization

What if you need to visualize more than 1000 variables?

One possible way is to visualize high dimensional data is toreduce it to 2 or 3 dimensions.

Variable selection (Feature selection)

Principle Components (PCA analysis)

Multi-dimensional scaling

This is next topic!

Page 61: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 10 (Week 6) Exploratory Data Analysis (Visualization) DCCS208(02) Korea University 2019 Fall.

End of Slide