IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram ....

IAT 355 Visual Analytics

Data and Statistical Models Lyn Bartram

Exploring data

Example: US Census •  People # of people in group •  Year # 1850 – 2000 (every decade) •  Age # 0 – 90+ •  Sex (Gender) # Male, female •  Marital status # Single, Married, Divorced, …

•  2348 data points

Data and Image Models | IAT 4355 2 Slide adapted from Jeff Heer 2014

Census data: What type ( N, O, Q)?

Example: US Census •  People Q- Ratio •  Year Q- interval •  Age Q - Ratio •  Sex (Gender) N •  Marital status N

Census data: What type ( N, O, Q)?

Example: US Census •  People Count Measure (dependent variable) •  Year Dimension •  Age ?? •  Sex (Gender) Dimension •  Marital status Dimension

Roll-up and Drill-Down

Want to examine marital status in each decade •  Roll-up the data along the desired dimension

Roll-up and Drill-Down

Need more detailed information? •  Drill-down into additional dimensions

Distribution is important for understanding data •  Visualization helps us see relations – or the trends of them - as

visual patterns

•  a lot of what we visualize are the descriptive statistics •  Example: mean income vs median income •  Need to ensure that the aggregate units of visualization are legit

•  Rule: check your core units /variables. If hey are descriptive, look at the distribution

Data and Image Models | IAT 4355 8

Example: job losses in US over time

Visualizing distribution

•  We can’t really tell much about this data set

•  Even Min and Max are hard to see

Data and Image Models | IAT 4355

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X-axis labels

We can get a better idea of this data by looking at its distribution.

Data distribution

•  Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are.

•  Commonly used measures of dispersion include: 1.  Range 2.  Variance & Standard deviation 3.  Coefficient of Variation (or relative standard deviation) 4.  Inter-quartile range

January 15, 2014 Data Mining: Concepts and Techniques

Distribution and symmetry

positively skewed negatively skewed

symmetric

Adapted from Han, Kamber and Pei 2013

•  Median, mean and mode of symmetric, posi3vely and nega3vely skewed data

Properties of Normal Distribution Curve

•  The normal (distribu3on) curve •  From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard devia3on)

•  From μ–2σ to μ+2σ: contains about 95% of it •  From μ–3σ to μ+3σ: contains about 99.7% of it

Normal and Skewed Distributions

•  When data are skewed, the mean and SD can be misleading

•  Skewness sk= 3(mean-median)/SD If sk>|1| then distribution is non-symetrical

•  Negatively skewed •  Mean<Median •  Sk is negative

•  Positively Skewed •  Mean>Median •  Sk is positive

0 20 40 60 80 100 120 140 160

25 45 65 85 105 125 145 165 185 205 225

Measuring the Dispersion of Data

•  Quar3les, outliers and boxplots

•  Quar%les: Q1 (25th percen3le), Q3 (75th percen3le)

•  Inter-‐quar%le range: IQR = Q3 – Q1

•  Five number summary: min, Q1, median, Q3, max

•  Boxplot: ends of the box are the quar3les; median is marked; add whiskers, and plot outliers individually

•  Outlier: usually, a value higher/lower than 1.5 x IQR

•  Variance and standard devia%on (sample: s, popula,on: σ)

•  Measure of how spread out the numbers are

Adapted from Han, Kamber and Pei 2013

Measures of variance

•  Variance •  One measure of dispersion (deviation from the mean) of a data

set. The larger the variance, the greater is the standard deviation.

•  Standard Deviation •  the average deviation from the mean of a data set. •  Determines overall how spread out the data values are

•  Variance and SD are critical in analysing your data distribution and determining how “meaningful” is the chosen average

Graphic Displays of Basic Statistical Descriptions

•  Boxplot: graphic display of five-‐number summary

•  Histogram: x-‐axis are values, y-‐axis repres. frequencies

•  Sca?er plot: each pair of values is a pair of coordinates and ploZed as

points in the plane

•  Quan%le plot: each value xi is paired with fi indica3ng that

approximately 100 fi % of data are ≤ xi

•  Quan%le-‐quan%le (q-‐q) plot: graphs the quan3les of one univariant

distribu3on against the corresponding quan3les of another

19 Adapted from Han, Kamber and Pei 2013

Inter-quartile range

•  The Median divides a distribution into two halves.

•  The first and third quartiles (denoted Q1 and Q3) are defined as follows: •  25% of the data lie below Q1 (and 75% is above Q1),

•  25% of the data lie above Q3 (and 75% is below Q3)

•  The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q3- Q1

Outliers

•  An outlier is an datum which does not appear to belong with the other data

•  Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc.

•  An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient.

Box-plots

•  A box-plot is a visual description of the distribution based on •  Minimum •  Q1 •  Median •  Q3 •  Maximum

•  Useful for comparing large sets of data

Example 1: Box-plot

Outlier Boxplot

•  Re-define the upper and lower limits of the boxplots (the whisker lines) as: Lower limit = Q1-1.5×IQR, and Upper limit = Q3+1.5×IQR

•  Note that the lines may not go as far as these limits

•  If a data point is < lower limit or >

upper limit, the data point is considered to be an outlier.

January 15, 2014 Data Mining: Concepts and Techniques

Visualization of Data Dispersion: 3-D Boxplots

Histogram

•  Most common form: split data range into equal-sized bins and count the number of points from the data set that fall into the bin. •  Vertical axis: Frequency (i.e., counts for each bin) •  Horizontal axis: Response variable

•  The histogram graphically shows the following: 1.  center (i.e., the location) of the data; 2.  spread (i.e., the scale) of the data; 3.  skewness of the data; 4.  presence of outliers; and 5.  presence of multiple modes in the data.

Q à Q’ à N

Histograms Often Tell More than Boxplots

n  The two histograms shown in the le\ may have the same boxplot representa3on

n  The same values for: min, Q1, median, Q3, max

n  But they have rather different data distribu3ons

Plotting the distribution

•  Determine a frequency table (bins) •  A histogram is a column chart of the frequencies

Category Labels Frequency 0-50 3

51-60 2 61-70 6 71-80 5 81-90 3 >90 1 0

0-50 51-60 61-70 71-80 81-90 >90

Scores

Frequency

Issues with Histograms

•  For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms.

•  Interactive bin-width example (online applet) •  http://www.stat.sc.edu/~west/javahtml/Histogram.html

•  For large data sets, histograms can be quite effective at illustrating general

properties of the distribution.

•  Histograms effectively only work with 1 variable at a time •  Difficult to extend to 2 dimensions, not possible for >2 •  So histograms tell us nothing about the relationships among variables

Scatter plot

•  Provides a first look at bivariate data to see clusters of points, outliers, etc

•  Each pair of values is treated as a pair of coordinates and ploZed as points in the plane

Positively and Negatively Correlated Data

•  The le\ half fragment is posi3vely correlated

•  The right half is nega3vely correlated

Uncorrelated Data

Slide adapted from David Lippman's

Correlation

A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired (x,y) sample data are plotted on a graph. The linear correlation coefficient r measures the strength of the linear relationship.

•  Also called the Pearson correlation coefficient. •  Ranges from -1 to 1.

r = 1 represents a perfect positive correlation. r = 0 represents no correlation r = -1 represents a perfect negative correlation

Correlation

•  Assesses the linear relationship between two variables •  Example: height and weight

•  Strength of the association is described by a correlation coefficient- r

•  r = 0 - .2 low, probably meaningless •  r = .2 - .4 low, possible importance •  r = .4 - .6 moderate correlation •  r = .6 - .8 high correlation •  r = .8 - 1 very high correlation

•  Can be positive or negative •  Pearson’s, Spearman correlation coefficient •  Tells nothing about causation

Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80

Strong negative No Correlation Non-linear correlation r = -0.98 r = 0.16 relationship

Meanings

r2 represents the proportion of the variation in y that is explained by the linear relationship between x and y.

Example: Using the heights and weights for a group of people, you find the correlation coefficient to be:

r = 0.796, so r2 = 0.634.

So we conclude that about 63.4% of the peoples’ weight can be

explained by the relationship between height and weight. This suggests that 36.6% of the variation in weights cannot be explained by height.

r2 in Tableau

Example: Relationship between Tree Circumference and Height

0 5 10 15

Circumference (ft)

Relationship between Tree Circumference and Height

0 5 10 15

Circumference (ft)

Outliers can strongly influence the graph of the regression line and inflate the correlation coefficient. In the above example, removing the outlier drops the correlation coefficient from r = 0.828 to r = 0.678.

Correlation

Source: Altman. Practical Statistics for Medical Research

Correlation Coefficient 0 Correlation Coefficient .3

Correlation

Source: Altman. Practical Statistics for Medical Research

Correlation Coefficient -.5 Correlation Coefficient .7

Summary

•  Statistical models serve to inspect and categorise the nature of trends and relations between variables and fators (effects)

•  Distribution is a critical element in deciding what statistical measures to use, should be the lens by which you determine the appropriate metric

•  “eyeballing” your distribution is a first step in forming your next queries

Four sets of data with the same correlation of 0.816

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9Avg. user_score

Avg. actual_score

Sheet 7action_name

Dry Clothes

Dry Your Hair

Fridge Setting

Heat Milk

Lighting

Make Fried Eggs

Make Toast

Play Games

Room Temperature Control

Turn Off Unused Appliances

Unplug Unused Appliances

Wash Clothes

Wash Dishes

Wash Yourself

Watch Movies

Average of user_score vs. average of actual_score. Color shows details about action_name.

display

chart treewithberry treewithoutber..

Avg. actual_score

Sheet 8action_name

Dry Clothes

Dry Your Hair

Fridge Setting

Heat Milk

Lighting

Make Fried Eggs

Make Toast

Play Games

Room Temperature Control

Turn Off Unused Appliances

Unplug Unused Appliances

Wash Clothes

Wash Dishes

Wash Yourself

Watch Movies

Average of actual_score for each display. Colorshows details about action_name.

IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram ....

Documents

Transcript of IAT 355 Visual Analytics Data and Statistical Models · Data and Statistical Models Lyn Bartram ....

Probability Models and Statistical Analyses for Ranking Data

Statistical Inference, Learning and Models for Big Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Broom: Converting Statistical Models to Tidy Data Frames

Data and Models for Statistical Parsing with Combinatory

Statistical Inference, Learning and Models for Big Datafisher.utstat.toronto.edu/reid/research/reid-Michigan.pdf · Statistical Inference, Learning and Models for Big Data Nancy Reid

Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models

Statistical Models in Data Mining

Statistical Models for NCAA Indoor Track and Field Data...Statistical Models for NCAA Indoor Track and Field Data ... of running on a particular track type should be taken into account.

ALTERNATIVE STATISTICAL MODELS THAT …d-scholarship.pitt.edu/10234/1/huberETD2004dec22.pdfALTERNATIVE STATISTICAL MODELS THAT ACCOUNT FOR CLUSTERING IN DENTAL IMPLANT FAILURE DATA

Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

Statistical correction of daily precipitation data from the climate models Jan Hnilica

Chapter 2: Fitting Statistical Models to Data · Chapter 2: Fitting Statistical Models to Data Section 2.1: Introduction Evolution is the product of a thousand stories. Individual

Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Data, models and reality: a modern statistical perspective

INFO 7470/ILRLE 7400 Statistical Tools: Basic Integrated Data Models

Statistical Models in Data Mining - University at Buffalo

Statistical Inference, Learning and Models for Big Data · Statistical Inference, Learning and Models for Big Data Nancy Reid University of Toronto P.R. Krishnaiah Memorial Lecture

Evaluating Statistical Models - Carnegie Mellon …cshalizi/402/lectures/03-evaluation/lecture-03.pdfEvaluating Statistical Models 36-402, Data Analysis 18 January 2011 Optional Readings:

Statistical Machine Learning€¦ · Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data