Post on 15-Jun-2020
IAT 355 Visual Analytics
Data and Statistical Models Lyn Bartram
Exploring data
Example: US Census • People # of people in group • Year # 1850 – 2000 (every decade) • Age # 0 – 90+ • Sex (Gender) # Male, female • Marital status # Single, Married, Divorced, …
• 2348 data points
Data and Image Models | IAT 4355 2 Slide adapted from Jeff Heer 2014
Census data: What type ( N, O, Q)?
Example: US Census • People Q- Ratio • Year Q- interval • Age Q - Ratio • Sex (Gender) N • Marital status N
Data and Image Models | IAT 4355 3 Slide adapted from Jeff Heer 2014
Census data: What type ( N, O, Q)?
Example: US Census • People Count Measure (dependent variable) • Year Dimension • Age ?? • Sex (Gender) Dimension • Marital status Dimension
Data and Image Models | IAT 4355 4 Slide adapted from Jeff Heer 2014
Roll-up and Drill-Down
Want to examine marital status in each decade • Roll-up the data along the desired dimension
Data and Image Models | IAT 4355 5 Slide adapted from Jeff Heer 2014
Roll-up and Drill-Down
Need more detailed information? • Drill-down into additional dimensions
Data and Image Models | IAT 4355 6 Slide adapted from Jeff Heer 2014
Data and Image Models | IAT 4355 7 Slide adapted from Jeff Heer 2014
Distribution is important for understanding data • Visualization helps us see relations – or the trends of them - as
visual patterns
• a lot of what we visualize are the descriptive statistics • Example: mean income vs median income • Need to ensure that the aggregate units of visualization are legit
• Rule: check your core units /variables. If hey are descriptive, look at the distribution
Data and Image Models | IAT 4355 8
Example: job losses in US over time
Data and Image Models | IAT 4355 9
Example: job losses in US over time
Data and Image Models | IAT 4355 10
Data and Image Models | IAT 4355 11
Visualizing distribution
• We can’t really tell much about this data set
• Even Min and Max are hard to see
Data and Image Models | IAT 4355
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X-axis labels
Data
Val
ues
We can get a better idea of this data by looking at its distribution.
12
Data distribution
• Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are.
• Commonly used measures of dispersion include: 1. Range 2. Variance & Standard deviation 3. Coefficient of Variation (or relative standard deviation) 4. Inter-quartile range
Data and Image Models | IAT 4355 13
January 15, 2014 Data Mining: Concepts and Techniques
14
Distribution and symmetry
positively skewed negatively skewed
symmetric
Adapted from Han, Kamber and Pei 2013
• Median, mean and mode of symmetric, posi3vely and nega3vely skewed data
15
Properties of Normal Distribution Curve
• The normal (distribu3on) curve • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard devia3on)
• From μ–2σ to μ+2σ: contains about 95% of it • From μ–3σ to μ+3σ: contains about 99.7% of it
Normal and Skewed Distributions
• When data are skewed, the mean and SD can be misleading
• Skewness sk= 3(mean-median)/SD If sk>|1| then distribution is non-symetrical
• Negatively skewed • Mean<Median • Sk is negative
• Positively Skewed • Mean>Median • Sk is positive
Data and Image Models | IAT 4355
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 20 40 60 80 100 120 140 160
0
0.02
0.04
0.06
0.08
0.1
0.12
25 45 65 85 105 125 145 165 185 205 225
16
Measuring the Dispersion of Data
• Quar3les, outliers and boxplots
• Quar%les: Q1 (25th percen3le), Q3 (75th percen3le)
• Inter-‐quar%le range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quar3les; median is marked; add whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard devia%on (sample: s, popula,on: σ)
• Measure of how spread out the numbers are
17
Adapted from Han, Kamber and Pei 2013
Measures of variance
• Variance • One measure of dispersion (deviation from the mean) of a data
set. The larger the variance, the greater is the standard deviation.
• Standard Deviation • the average deviation from the mean of a data set. • Determines overall how spread out the data values are
• Variance and SD are critical in analysing your data distribution and determining how “meaningful” is the chosen average
Data and Image Models | IAT 4355 18
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-‐number summary
• Histogram: x-‐axis are values, y-‐axis repres. frequencies
• Sca?er plot: each pair of values is a pair of coordinates and ploZed as
points in the plane
• Quan%le plot: each value xi is paired with fi indica3ng that
approximately 100 fi % of data are ≤ xi
• Quan%le-‐quan%le (q-‐q) plot: graphs the quan3les of one univariant
distribu3on against the corresponding quan3les of another
19 Adapted from Han, Kamber and Pei 2013
20
Inter-quartile range
• The Median divides a distribution into two halves.
• The first and third quartiles (denoted Q1 and Q3) are defined as follows: • 25% of the data lie below Q1 (and 75% is above Q1),
• 25% of the data lie above Q3 (and 75% is below Q3)
• The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q3- Q1
Data and Image Models | IAT 4355
21
Outliers
• An outlier is an datum which does not appear to belong with the other data
• Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc.
• An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient.
Data and Image Models | IAT 4355
22
Box-plots
• A box-plot is a visual description of the distribution based on • Minimum • Q1 • Median • Q3 • Maximum
• Useful for comparing large sets of data
Data and Image Models | IAT 4355
23
Example 1: Box-plot
Data and Image Models | IAT 4355
24
Outlier Boxplot
• Re-define the upper and lower limits of the boxplots (the whisker lines) as: Lower limit = Q1-1.5×IQR, and Upper limit = Q3+1.5×IQR
• Note that the lines may not go as far as these limits
• If a data point is < lower limit or >
upper limit, the data point is considered to be an outlier.
Data and Image Models | IAT 4355
January 15, 2014 Data Mining: Concepts and Techniques
25
Visualization of Data Dispersion: 3-D Boxplots
Data and Image Models | IAT 4355
Histogram
• Most common form: split data range into equal-sized bins and count the number of points from the data set that fall into the bin. • Vertical axis: Frequency (i.e., counts for each bin) • Horizontal axis: Response variable
• The histogram graphically shows the following: 1. center (i.e., the location) of the data; 2. spread (i.e., the scale) of the data; 3. skewness of the data; 4. presence of outliers; and 5. presence of multiple modes in the data.
26
Q à Q’ à N
Histograms Often Tell More than Boxplots
27
n The two histograms shown in the le\ may have the same boxplot representa3on
n The same values for: min, Q1, median, Q3, max
n But they have rather different data distribu3ons
Plotting the distribution
• Determine a frequency table (bins) • A histogram is a column chart of the frequencies
Data and Image Models | IAT 4355
Category Labels Frequency 0-50 3
51-60 2 61-70 6 71-80 5 81-90 3 >90 1 0
1
2
3
4
5
6
7
0-50 51-60 61-70 71-80 81-90 >90
Scores
Frequency
28
Data and Image Models | IAT 4355
Issues with Histograms
• For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms.
• Interactive bin-width example (online applet) • http://www.stat.sc.edu/~west/javahtml/Histogram.html
• For large data sets, histograms can be quite effective at illustrating general
properties of the distribution.
• Histograms effectively only work with 1 variable at a time • Difficult to extend to 2 dimensions, not possible for >2 • So histograms tell us nothing about the relationships among variables
29
30
Scatter plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and ploZed as points in the plane
Positively and Negatively Correlated Data
• The le\ half fragment is posi3vely correlated
• The right half is nega3vely correlated
31
32
Uncorrelated Data
Slide adapted from David Lippman's
Correlation
A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired (x,y) sample data are plotted on a graph. The linear correlation coefficient r measures the strength of the linear relationship.
• Also called the Pearson correlation coefficient. • Ranges from -1 to 1.
r = 1 represents a perfect positive correlation. r = 0 represents no correlation r = -1 represents a perfect negative correlation
Correlation
• Assesses the linear relationship between two variables • Example: height and weight
• Strength of the association is described by a correlation coefficient- r
• r = 0 - .2 low, probably meaningless • r = .2 - .4 low, possible importance • r = .4 - .6 moderate correlation • r = .6 - .8 high correlation • r = .8 - 1 very high correlation
• Can be positive or negative • Pearson’s, Spearman correlation coefficient • Tells nothing about causation
Slide adapted from David Lippman's
Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80
Strong negative No Correlation Non-linear correlation r = -0.98 r = 0.16 relationship
Slide adapted from David Lippman's
Meanings
r2 represents the proportion of the variation in y that is explained by the linear relationship between x and y.
Example: Using the heights and weights for a group of people, you find the correlation coefficient to be:
r = 0.796, so r2 = 0.634.
So we conclude that about 63.4% of the peoples’ weight can be
explained by the relationship between height and weight. This suggests that 36.6% of the variation in weights cannot be explained by height.
r2 in Tableau
Slide adapted from David Lippman's
Example: Relationship between Tree Circumference and Height
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15
Circumference (ft)
Heig
ht (f
t)
Slide adapted from David Lippman's
Relationship between Tree Circumference and Height
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15
Circumference (ft)
Heig
ht (f
t)
Outliers can strongly influence the graph of the regression line and inflate the correlation coefficient. In the above example, removing the outlier drops the correlation coefficient from r = 0.828 to r = 0.678.
Correlation
Source: Altman. Practical Statistics for Medical Research
Correlation Coefficient 0 Correlation Coefficient .3
Correlation
Source: Altman. Practical Statistics for Medical Research
Correlation Coefficient -.5 Correlation Coefficient .7
Summary
• Statistical models serve to inspect and categorise the nature of trends and relations between variables and fators (effects)
• Distribution is a critical element in deciding what statistical measures to use, should be the lens by which you determine the appropriate metric
• “eyeballing” your distribution is a first step in forming your next queries
42
Four sets of data with the same correlation of 0.816
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9Avg. user_score
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Avg. actual_score
Sheet 7action_name
Dry Clothes
Dry Your Hair
Fridge Setting
Heat Milk
Lighting
Make Fried Eggs
Make Toast
Play Games
Room Temperature Control
Turn Off Unused Appliances
Unplug Unused Appliances
Wash Clothes
Wash Dishes
Wash Yourself
Watch Movies
Average of user_score vs. average of actual_score. Color shows details about action_name.
44
Data and Image Models | IAT 4355 45
display
chart treewithberry treewithoutber..
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Avg. actual_score
Sheet 8action_name
Dry Clothes
Dry Your Hair
Fridge Setting
Heat Milk
Lighting
Make Fried Eggs
Make Toast
Play Games
Room Temperature Control
Turn Off Unused Appliances
Unplug Unused Appliances
Wash Clothes
Wash Dishes
Wash Yourself
Watch Movies
Average of actual_score for each display. Colorshows details about action_name.