IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram ....

45
IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram

Transcript of IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram ....

Page 1: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

IAT 355 Visual Analytics

Data and Statistical Models Lyn Bartram

Page 2: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Exploring data

Example: US Census •  People # of people in group •  Year # 1850 – 2000 (every decade) •  Age # 0 – 90+ •  Sex (Gender) # Male, female •  Marital status # Single, Married, Divorced, …

•  2348 data points

Data and Image Models | IAT 4355 2 Slide adapted from Jeff Heer 2014

Page 3: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Census data: What type ( N, O, Q)?

Example: US Census •  People Q- Ratio •  Year Q- interval •  Age Q - Ratio •  Sex (Gender) N •  Marital status N

Data and Image Models | IAT 4355 3 Slide adapted from Jeff Heer 2014

Page 4: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Census data: What type ( N, O, Q)?

Example: US Census •  People Count Measure (dependent variable) •  Year Dimension •  Age ?? •  Sex (Gender) Dimension •  Marital status Dimension

Data and Image Models | IAT 4355 4 Slide adapted from Jeff Heer 2014

Page 5: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Roll-up and Drill-Down

Want to examine marital status in each decade •  Roll-up the data along the desired dimension

Data and Image Models | IAT 4355 5 Slide adapted from Jeff Heer 2014

Page 6: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Roll-up and Drill-Down

Need more detailed information? •  Drill-down into additional dimensions

Data and Image Models | IAT 4355 6 Slide adapted from Jeff Heer 2014

Page 7: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data and Image Models | IAT 4355 7 Slide adapted from Jeff Heer 2014

Page 8: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Distribution is important for understanding data •  Visualization helps us see relations – or the trends of them - as

visual patterns

•  a lot of what we visualize are the descriptive statistics •  Example: mean income vs median income •  Need to ensure that the aggregate units of visualization are legit

•  Rule: check your core units /variables. If hey are descriptive, look at the distribution

Data and Image Models | IAT 4355 8

Page 9: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Example: job losses in US over time

Data and Image Models | IAT 4355 9

Page 10: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Example: job losses in US over time

Data and Image Models | IAT 4355 10

Page 11: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data and Image Models | IAT 4355 11

Page 12: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Visualizing distribution

•  We can’t really tell much about this data set

•  Even Min and Max are hard to see

Data and Image Models | IAT 4355

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X-axis labels

Data

Val

ues

We can get a better idea of this data by looking at its distribution.

12

Page 13: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data distribution

•  Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are.

•  Commonly used measures of dispersion include: 1.  Range 2.  Variance & Standard deviation 3.  Coefficient of Variation (or relative standard deviation) 4.  Inter-quartile range

Data and Image Models | IAT 4355 13

Page 14: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

January 15, 2014 Data Mining: Concepts and Techniques

14

Distribution and symmetry

positively skewed negatively skewed

symmetric

Adapted from Han, Kamber and Pei 2013

•  Median,  mean  and  mode  of  symmetric,  posi3vely  and  nega3vely  skewed  data  

Page 15: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

15

Properties of Normal Distribution Curve

•  The  normal  (distribu3on)  curve  •  From  μ–σ  to  μ+σ:  contains  about  68%  of  the  measurements    (μ:  mean,  σ:  standard  devia3on)  

•   From  μ–2σ  to  μ+2σ:  contains  about  95%  of  it  •  From  μ–3σ  to  μ+3σ:  contains  about  99.7%  of  it  

 

Page 16: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Normal and Skewed Distributions

•  When data are skewed, the mean and SD can be misleading

•  Skewness sk= 3(mean-median)/SD If sk>|1| then distribution is non-symetrical

•  Negatively skewed •  Mean<Median •  Sk is negative

•  Positively Skewed •  Mean>Median •  Sk is positive

Data and Image Models | IAT 4355

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 20 40 60 80 100 120 140 160

0

0.02

0.04

0.06

0.08

0.1

0.12

25 45 65 85 105 125 145 165 185 205 225

16

Page 17: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Measuring the Dispersion of Data

•  Quar3les,  outliers  and  boxplots  

•  Quar%les:  Q1  (25th  percen3le),  Q3  (75th  percen3le)  

•  Inter-­‐quar%le  range:  IQR  =  Q3  –  Q1    

•  Five  number  summary:  min,  Q1,  median,  Q3,  max  

•  Boxplot:  ends  of  the  box  are  the  quar3les;  median  is  marked;  add  whiskers,  and  plot  outliers  individually  

•  Outlier:  usually,  a  value  higher/lower  than  1.5  x  IQR  

•  Variance  and  standard  devia%on  (sample:  s,  popula,on:  σ)  

•  Measure  of  how  spread  out  the  numbers  are  

17

Adapted from Han, Kamber and Pei 2013

Page 18: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Measures of variance

•  Variance •  One measure of dispersion (deviation from the mean) of a data

set. The larger the variance, the greater is the standard deviation.

•  Standard Deviation •  the average deviation from the mean of a data set. •  Determines overall how spread out the data values are

•  Variance and SD are critical in analysing your data distribution and determining how “meaningful” is the chosen average

Data and Image Models | IAT 4355 18

Page 19: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Graphic Displays of Basic Statistical Descriptions

•  Boxplot:  graphic  display  of  five-­‐number  summary  

•  Histogram:  x-­‐axis  are  values,  y-­‐axis  repres.  frequencies    

•  Sca?er  plot:  each  pair  of  values  is  a  pair  of  coordinates  and  ploZed  as  

points  in  the  plane  

•  Quan%le  plot:    each  value  xi    is  paired  with  fi    indica3ng  that  

approximately  100  fi  %  of  data    are  ≤  xi    

•  Quan%le-­‐quan%le  (q-­‐q)  plot:  graphs  the  quan3les  of  one  univariant  

distribu3on  against  the  corresponding  quan3les  of  another  

19 Adapted from Han, Kamber and Pei 2013

Page 20: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

20

Inter-quartile range

•  The Median divides a distribution into two halves.

•  The first and third quartiles (denoted Q1 and Q3) are defined as follows: •  25% of the data lie below Q1 (and 75% is above Q1),

•  25% of the data lie above Q3 (and 75% is below Q3)

•  The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q3- Q1

Data and Image Models | IAT 4355

Page 21: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

21

Outliers

•  An outlier is an datum which does not appear to belong with the other data

•  Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc.

•  An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient.

Data and Image Models | IAT 4355

Page 22: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

22

Box-plots

•  A box-plot is a visual description of the distribution based on •  Minimum •  Q1 •  Median •  Q3 •  Maximum

•  Useful for comparing large sets of data

Data and Image Models | IAT 4355

Page 23: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

23

Example 1: Box-plot

Data and Image Models | IAT 4355

Page 24: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

24

Outlier Boxplot

•  Re-define the upper and lower limits of the boxplots (the whisker lines) as: Lower limit = Q1-1.5×IQR, and Upper limit = Q3+1.5×IQR

•  Note that the lines may not go as far as these limits

•  If a data point is < lower limit or >

upper limit, the data point is considered to be an outlier.

Data and Image Models | IAT 4355

Page 25: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

January 15, 2014 Data Mining: Concepts and Techniques

25

Visualization of Data Dispersion: 3-D Boxplots

Page 26: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data and Image Models | IAT 4355

Histogram

•  Most common form: split data range into equal-sized bins and count the number of points from the data set that fall into the bin. •  Vertical axis: Frequency (i.e., counts for each bin) •  Horizontal axis: Response variable

•  The histogram graphically shows the following: 1.  center (i.e., the location) of the data; 2.  spread (i.e., the scale) of the data; 3.  skewness of the data; 4.  presence of outliers; and 5.  presence of multiple modes in the data.

26

Q à Q’ à N

Page 27: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Histograms Often Tell More than Boxplots

27

n  The  two  histograms  shown  in  the  le\  may  have  the  same  boxplot  representa3on  

n  The  same  values  for:  min,  Q1,  median,  Q3,  max  

n  But  they  have  rather  different  data  distribu3ons  

Page 28: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Plotting the distribution

•  Determine a frequency table (bins) •  A histogram is a column chart of the frequencies

Data and Image Models | IAT 4355

Category Labels Frequency 0-50 3

51-60 2 61-70 6 71-80 5 81-90 3 >90 1 0

1

2

3

4

5

6

7

0-50 51-60 61-70 71-80 81-90 >90

Scores

Frequency

28

Page 29: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data and Image Models | IAT 4355

Issues with Histograms

•  For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms.

•  Interactive bin-width example (online applet) •  http://www.stat.sc.edu/~west/javahtml/Histogram.html

•  For large data sets, histograms can be quite effective at illustrating general

properties of the distribution.

•  Histograms effectively only work with 1 variable at a time •  Difficult to extend to 2 dimensions, not possible for >2 •  So histograms tell us nothing about the relationships among variables

29

Page 30: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

30

Scatter plot

•  Provides  a  first  look  at  bivariate  data  to  see  clusters  of  points,  outliers,  etc  

•  Each  pair  of  values  is  treated  as  a  pair  of  coordinates  and  ploZed  as  points  in  the  plane  

Page 31: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Positively and Negatively Correlated Data

•  The  le\  half  fragment  is  posi3vely  correlated  

•  The  right  half  is  nega3vely    correlated  

31

Page 32: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

32

Uncorrelated Data

Page 33: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Slide adapted from David Lippman's

Correlation

A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired (x,y) sample data are plotted on a graph. The linear correlation coefficient r measures the strength of the linear relationship.

•  Also called the Pearson correlation coefficient. •  Ranges from -1 to 1.

r = 1 represents a perfect positive correlation. r = 0 represents no correlation r = -1 represents a perfect negative correlation

Page 34: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Correlation

•  Assesses the linear relationship between two variables •  Example: height and weight

•  Strength of the association is described by a correlation coefficient- r

•  r = 0 - .2 low, probably meaningless •  r = .2 - .4 low, possible importance •  r = .4 - .6 moderate correlation •  r = .6 - .8 high correlation •  r = .8 - 1 very high correlation

•  Can be positive or negative •  Pearson’s, Spearman correlation coefficient •  Tells nothing about causation

Page 35: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Slide adapted from David Lippman's

Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80

Strong negative No Correlation Non-linear correlation r = -0.98 r = 0.16 relationship

Page 36: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Slide adapted from David Lippman's

Meanings

r2 represents the proportion of the variation in y that is explained by the linear relationship between x and y.

Example: Using the heights and weights for a group of people, you find the correlation coefficient to be:

r = 0.796, so r2 = 0.634.

So we conclude that about 63.4% of the peoples’ weight can be

explained by the relationship between height and weight. This suggests that 36.6% of the variation in weights cannot be explained by height.

Page 37: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

r2 in Tableau

Page 38: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Slide adapted from David Lippman's

Example: Relationship between Tree Circumference and Height

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15

Circumference (ft)

Heig

ht (f

t)

Page 39: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Slide adapted from David Lippman's

Relationship between Tree Circumference and Height

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15

Circumference (ft)

Heig

ht (f

t)

Outliers can strongly influence the graph of the regression line and inflate the correlation coefficient. In the above example, removing the outlier drops the correlation coefficient from r = 0.828 to r = 0.678.

Page 40: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Correlation

Source: Altman. Practical Statistics for Medical Research

Correlation Coefficient 0 Correlation Coefficient .3

Page 41: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Correlation

Source: Altman. Practical Statistics for Medical Research

Correlation Coefficient -.5 Correlation Coefficient .7

Page 42: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Summary

•  Statistical models serve to inspect and categorise the nature of trends and relations between variables and fators (effects)

•  Distribution is a critical element in deciding what statistical measures to use, should be the lens by which you determine the appropriate metric

•  “eyeballing” your distribution is a first step in forming your next queries

42

Page 43: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Four sets of data with the same correlation of 0.816

Page 44: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9Avg.  user_score

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Avg.  actual_score

Sheet 7action_name

Dry  Clothes

Dry  Your  Hair

Fridge  Setting

Heat  Milk

Lighting

Make  Fried  Eggs

Make  Toast

Play  Games

Room  Temperature  Control

Turn  Off  Unused  Appliances

Unplug  Unused  Appliances

Wash  Clothes

Wash  Dishes

Wash  Yourself

Watch  Movies

Average  of  user_score  vs.  average  of  actual_score.    Color  shows  details  about  action_name.

44

Page 45: IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram . Exploring data Example: US Census • People # of people in group • Year # 1850

Data and Image Models | IAT 4355 45

display

chart treewithberry treewithoutber..

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Avg.  actual_score

Sheet 8action_name

Dry  Clothes

Dry  Your  Hair

Fridge  Setting

Heat  Milk

Lighting

Make  Fried  Eggs

Make  Toast

Play  Games

Room  Temperature  Control

Turn  Off  Unused  Appliances

Unplug  Unused  Appliances

Wash  Clothes

Wash  Dishes

Wash  Yourself

Watch  Movies

Average  of  actual_score  for  each  display.    Colorshows  details  about  action_name.