Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and...

71
Exploratory Data Analysis

Transcript of Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and...

Page 1: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Exploratory Data Analysis

Page 2: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Height

We

igh

t

140 150 160 170 180 190 200

40

50

60

70

80

90

10

0

MaleFemale

Height and Weight

Height

Weig

ht

140 150 160 170 180 190 200

40

50

60

70

80

90

100

MaleFemale

Height

Weig

ht

140 150 160 170 180 190 200

40

50

60

70

80

90

100

MaleFemale

Page 3: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

1. Data checking, identifying problems and characteristics

Data exploration and Statistical analysis

Page 4: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

DataData exploration,

categorical / numerical outcomes

Page 5: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• Look at the data (initial checks on the data)

• Downloading data, formatting, data collection, discrepant data, missing data

• Visualize the data (exploratory data analysis)

• Descriptive statistics, informative tables, well-constructed figures

• Analyse the data (definitive analysis)

• Formal statistical analysis

• Quantify any interesting results

• Report the findings

Analyzing a set of data

Page 6: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• Often, test to use depends on the type of variable at hand

• Two main classes of variables:

• Categorical

• Numerical

• Categorical variables further divided into two sub-classes:

• Nominal categorical (example: gender, ethnic groups)

• Ordinal categorical (example: size of a car, quality of teaching)

Types of Variables

Page 7: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• Distinguish between discrete or continuous numerical variables

Discrete

• Integer values (number of male subjects, number of episodes of flu outbreaks)

Continuous

• Takes a whole range of values (height, weight)

• Continuous variables treated as discrete (age)

Numerical variables

Page 8: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Exploratory Data Analysis

Page 9: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

EDA

• Tabular EDA

• Univariate tables, cross-tabulation of categorical variables

• Numerical EDA

• Location, spread, skewness, covariance and correlation

• Graphical EDA

• Frequency plots, histograms, boxplots, scatterplots

The precise form of EDA depends on the data at hand.

Page 10: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Tabular EDA

• Useful for summarising categorical data.

For example, the following table shows the classification of 2,555 students from three schools in a study on the GCSE O-level results in Mathematics:

Dunman High HCI RI / RGS Total

No. of students

Dunman High HCI RI / RGS Total

No. of students 6 408 1496 1910

Small counts are problematic in categorical data analysis

Page 11: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Tabular EDA

• For two categorical variables: i.e. the distribution of the A, B and others grades between two schools

Question: Appears that Dunman High has proportionally more students scoring A/B grades than HCI. Does this mean anything?

Dunman High

HCI

School A B Others

Page 12: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Numerical EDA

• Calculating informative numbers which summarise the dataset

• What are the numbers useful for describing the age of 1,059 individuals with diabetes?

20 30 40 50 60 70 80

AGE

• Location parameters (mean, median, mode)

Mean age (54.6 years)

• Spread (range, standard deviation, interquartile range)

• Skewness

Page 13: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Skewness

Right Skew

Observation

Fre

qu

en

cy

0 5 10 15

0.0

0.0

50

.15

0.2

5Numerical EDA

MeanMedian

Page 14: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Normal distribution

Exam marks for Mathematics exam

40 50 60 70 80

68% of the probability, 1 standard deviation away

95% of the probability, 2 SDs away

Page 15: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• Sample QuartilesQ1: 25th quantile (or value of the 25% ranked data)Q2: 50th quantile (also known as median of data)Q3: 75th quantile (or value of the 75% ranked data)

Consider the heights of 1000 people, rank these heights from shortest to tallest.

Numerical EDA

Q1 Q2 Q3

Page 16: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• When mean is used as the location parameter, the standard deviation is the appropriate measure for spread

• When median is used as the location parameter, the corresponding measure for spread is the interquartile range

• Interquartile range (IQR)IQR = Q3 – Q1

• Minimum, Maximum of data (seldom used to quantify spread, but more for data QC)

Location and spread

Page 17: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Numerical EDA

• Numbers can be informative to identify potential problems with the data

Example: Suppose the height for 1,496 individuals randomly sampled from the population produces the following summary

IQR = Q3 – Q1 = 188 – 172 = 16

Range = Max – Min = 201 – 0 = 201

Page 18: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Correlation

• Two numerical variables: height and weight

Questions

• Are there any relationship between these variables?

• If there is, how do we quantify this relationship?

• Covariance and Correlation Measures the degree of association between two numerical variables.

Page 19: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Covariance and Correlation

• Covariance is scale-dependent, and correlation is unit-free.

• More intuitive to interpret correlation than covariance.

Example: Covariance for height and weight is 2.4 when assessed using metres and kilograms, but 240,000 when assessed using centimetres and grams. Correlation is a constant value at 0.83 for both scenario.

• Correlation is unit-free, and always bounded between -1 and 1 inclusive.

• Useful for investigating relationships between variables, (e.g. weight and height)

Page 20: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Example

Page 21: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Graphical EDA

• Visual summaries of the data

• Flagging outliers, obvious relationships, check for distribution

Page 22: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Boxplots

• Univariate boxplot: for 1 numerical variable

Ends of box: Q1 and Q3

Length of box: IQR

White line: Sample median

Whiskers: 1.5 times IQR

Lines outside whiskers: Outliers

Circles: Extreme outliers

Page 23: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Boxplots

• Multivariate boxplots: for 1 numerical variable across different levels of a categorical variable

• Graphical comparison

Page 24: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Scatterplots

• Graphical representation for 2 numerical variables

Height

We

igh

t

140 150 160 170 180 190 200

40

50

60

70

80

90

10

0

MaleFemale

Page 25: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Scatterplots

Perfect Positive Correlation

x

y

2 4 6 8 10

51

01

52

02

53

0

Perfect Negative Correlation

x

y

2 4 6 8 10

-25

-20

-15

-10

-50

Correlation = 0.8

x

y

4.0 4.5 5.0 5.5 6.0 6.5

6.5

7.0

7.5

8.0

8.5

9.0

Correlation = -0.3

x

y

4.0 5.0 6.0 7.0

67

89

Page 26: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Scatterplots

Correlation = 0.0

x

y

3 4 5 6 7

67

89

10

Page 27: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Exploratory Data Analysis in RExcel and SPSS

Page 28: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Comparing height of children

• Height data for 30 children, from 3 groups

• Interest to compare height of children between groups

• Useful (and not useful!) data exploration

Page 29: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 30: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 31: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 32: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 33: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 34: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 35: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 36: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Comparing height of children

• Height data for 30 children, from 3 groups

• Interest to compare height of children between groups

• Useful (and not useful!) data exploration

Page 37: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Coding numerical variables as factors

Retain numbers as categories, or to define new names for the categories

Note the deliberate mistake here! Always know your variables well!

Page 38: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Stratified analysis by group

Click on this to define the variable that contains the grouping information for stratification

Page 39: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 40: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Boxplots

Choose this to produce separate boxplots for the three groups (stratified analysis)

Page 41: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Minimum

Maximum

Median

1st quartile

2nd quartile

Interquartile range

25%

25%

Page 42: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

An excellent way to observe graphical/preliminary evidence of any differences between the groups!

No comments can be made if the boxes overlap. Only when two boxes (or more) do not overlap can we say there is graphical evidence of a difference between the two (or more) groups

Page 43: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

What about SPSS?

Page 44: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 45: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 46: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 47: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 48: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 49: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 50: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Never choose this when plotting a histogram to get a gauge of the distribution of the dataset

Page 51: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 52: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 53: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 54: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

To perform a stratified analysis, place the grouping variable under Factor List

Page 55: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Default is Stem-and-leaf, remember to change it to Histogram

Check this to perform a quantitative test for normality

Page 56: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 57: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Numerical summaries

Page 58: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 59: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Tests of Normality

.182 10 .200* .899 10 .211

.176 10 .200* .948 10 .648

.134 10 .200* .955 10 .730

group1.00

2.00

3.00

heightStatistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova

Shapiro-Wilk

This is a lower bound of the true significance.*.

Lilliefors Significance Correctiona.

Statistical test for departure from normality

Statistical evidence, known as significance levels or P-values

For the time being:- If values are > 0.05, interpret as normality assumption is valid;- If values are < 0.05, interpret as normality assumption is not valid, and the variable does not follow a normal distribution.

Page 60: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

There really isn’t much difference between RExcel and SPSS

Page 61: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Data checking

Page 62: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Omega 3 consumption and mathematical abilities

• Students from 3 schools participated in a study to assess the effects of omega 3 on mathematical abilities

• For each student, there is information on:

• school

• gender

• marks before

• marks after

• daily omega 3 consumption (mg)

Page 63: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

Zeroes are important to take note of, but how do we decide whether they are plausible values or problematic values?

Page 64: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 65: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

For flagging outliers in boxplots

Now we need to exclude these datapoints

Page 66: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 67: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 68: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 69: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Page 70: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• realise that data exploration prior to formal statistical analysis is important;

• know what to look out for in data checking of categorical variables

• know what to look out for in data checking of numerical variables

• understand the use of frequencies (percentages) for categorical data summary

• understand which location and variability metrics to use for numerical data

• understand the use and interpretation of histograms

Students should be able to

Page 71: Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.

• interpret boxplots, for variable summary and for graphical comparisons

• know the usage and interpretation of scatterplots

• perform data entry in RExcel and SPSS

• perform exploratory data analysis in RExcel and SPSS

• identify and remove problematic data in RExcel and SPSS

• generate useful summary tables and figures for a dataset in investigating research hypotheses

• interpret generated summary tables and figures of a dataset for investigating research hypotheses

Cont...