Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and...
-
Upload
deborah-welch -
Category
Documents
-
view
238 -
download
1
Transcript of Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and...
Exploratory Data Analysis
Height
We
igh
t
140 150 160 170 180 190 200
40
50
60
70
80
90
10
0
MaleFemale
Height and Weight
Height
Weig
ht
140 150 160 170 180 190 200
40
50
60
70
80
90
100
MaleFemale
Height
Weig
ht
140 150 160 170 180 190 200
40
50
60
70
80
90
100
MaleFemale
1. Data checking, identifying problems and characteristics
Data exploration and Statistical analysis
DataData exploration,
categorical / numerical outcomes
• Look at the data (initial checks on the data)
• Downloading data, formatting, data collection, discrepant data, missing data
• Visualize the data (exploratory data analysis)
• Descriptive statistics, informative tables, well-constructed figures
• Analyse the data (definitive analysis)
• Formal statistical analysis
• Quantify any interesting results
• Report the findings
Analyzing a set of data
• Often, test to use depends on the type of variable at hand
• Two main classes of variables:
• Categorical
• Numerical
• Categorical variables further divided into two sub-classes:
• Nominal categorical (example: gender, ethnic groups)
• Ordinal categorical (example: size of a car, quality of teaching)
Types of Variables
• Distinguish between discrete or continuous numerical variables
Discrete
• Integer values (number of male subjects, number of episodes of flu outbreaks)
Continuous
• Takes a whole range of values (height, weight)
• Continuous variables treated as discrete (age)
Numerical variables
Exploratory Data Analysis
EDA
• Tabular EDA
• Univariate tables, cross-tabulation of categorical variables
• Numerical EDA
• Location, spread, skewness, covariance and correlation
• Graphical EDA
• Frequency plots, histograms, boxplots, scatterplots
The precise form of EDA depends on the data at hand.
Tabular EDA
• Useful for summarising categorical data.
For example, the following table shows the classification of 2,555 students from three schools in a study on the GCSE O-level results in Mathematics:
Dunman High HCI RI / RGS Total
No. of students
Dunman High HCI RI / RGS Total
No. of students 6 408 1496 1910
Small counts are problematic in categorical data analysis
Tabular EDA
• For two categorical variables: i.e. the distribution of the A, B and others grades between two schools
Question: Appears that Dunman High has proportionally more students scoring A/B grades than HCI. Does this mean anything?
Dunman High
HCI
School A B Others
Numerical EDA
• Calculating informative numbers which summarise the dataset
• What are the numbers useful for describing the age of 1,059 individuals with diabetes?
20 30 40 50 60 70 80
AGE
• Location parameters (mean, median, mode)
Mean age (54.6 years)
• Spread (range, standard deviation, interquartile range)
• Skewness
Skewness
Right Skew
Observation
Fre
qu
en
cy
0 5 10 15
0.0
0.0
50
.15
0.2
5Numerical EDA
MeanMedian
Normal distribution
Exam marks for Mathematics exam
40 50 60 70 80
68% of the probability, 1 standard deviation away
95% of the probability, 2 SDs away
• Sample QuartilesQ1: 25th quantile (or value of the 25% ranked data)Q2: 50th quantile (also known as median of data)Q3: 75th quantile (or value of the 75% ranked data)
Consider the heights of 1000 people, rank these heights from shortest to tallest.
Numerical EDA
Q1 Q2 Q3
• When mean is used as the location parameter, the standard deviation is the appropriate measure for spread
• When median is used as the location parameter, the corresponding measure for spread is the interquartile range
• Interquartile range (IQR)IQR = Q3 – Q1
• Minimum, Maximum of data (seldom used to quantify spread, but more for data QC)
Location and spread
Numerical EDA
• Numbers can be informative to identify potential problems with the data
Example: Suppose the height for 1,496 individuals randomly sampled from the population produces the following summary
IQR = Q3 – Q1 = 188 – 172 = 16
Range = Max – Min = 201 – 0 = 201
Correlation
• Two numerical variables: height and weight
Questions
• Are there any relationship between these variables?
• If there is, how do we quantify this relationship?
• Covariance and Correlation Measures the degree of association between two numerical variables.
Covariance and Correlation
• Covariance is scale-dependent, and correlation is unit-free.
• More intuitive to interpret correlation than covariance.
Example: Covariance for height and weight is 2.4 when assessed using metres and kilograms, but 240,000 when assessed using centimetres and grams. Correlation is a constant value at 0.83 for both scenario.
• Correlation is unit-free, and always bounded between -1 and 1 inclusive.
• Useful for investigating relationships between variables, (e.g. weight and height)
Example
Graphical EDA
• Visual summaries of the data
• Flagging outliers, obvious relationships, check for distribution
Boxplots
• Univariate boxplot: for 1 numerical variable
Ends of box: Q1 and Q3
Length of box: IQR
White line: Sample median
Whiskers: 1.5 times IQR
Lines outside whiskers: Outliers
Circles: Extreme outliers
Boxplots
• Multivariate boxplots: for 1 numerical variable across different levels of a categorical variable
• Graphical comparison
Scatterplots
• Graphical representation for 2 numerical variables
Height
We
igh
t
140 150 160 170 180 190 200
40
50
60
70
80
90
10
0
MaleFemale
Scatterplots
Perfect Positive Correlation
x
y
2 4 6 8 10
51
01
52
02
53
0
Perfect Negative Correlation
x
y
2 4 6 8 10
-25
-20
-15
-10
-50
Correlation = 0.8
x
y
4.0 4.5 5.0 5.5 6.0 6.5
6.5
7.0
7.5
8.0
8.5
9.0
Correlation = -0.3
x
y
4.0 5.0 6.0 7.0
67
89
Scatterplots
Correlation = 0.0
x
y
3 4 5 6 7
67
89
10
Exploratory Data Analysis in RExcel and SPSS
Comparing height of children
• Height data for 30 children, from 3 groups
• Interest to compare height of children between groups
• Useful (and not useful!) data exploration
Comparing height of children
• Height data for 30 children, from 3 groups
• Interest to compare height of children between groups
• Useful (and not useful!) data exploration
Coding numerical variables as factors
Retain numbers as categories, or to define new names for the categories
Note the deliberate mistake here! Always know your variables well!
Stratified analysis by group
Click on this to define the variable that contains the grouping information for stratification
Boxplots
Choose this to produce separate boxplots for the three groups (stratified analysis)
Minimum
Maximum
Median
1st quartile
2nd quartile
Interquartile range
25%
25%
An excellent way to observe graphical/preliminary evidence of any differences between the groups!
No comments can be made if the boxes overlap. Only when two boxes (or more) do not overlap can we say there is graphical evidence of a difference between the two (or more) groups
What about SPSS?
Never choose this when plotting a histogram to get a gauge of the distribution of the dataset
To perform a stratified analysis, place the grouping variable under Factor List
Default is Stem-and-leaf, remember to change it to Histogram
Check this to perform a quantitative test for normality
Numerical summaries
Tests of Normality
.182 10 .200* .899 10 .211
.176 10 .200* .948 10 .648
.134 10 .200* .955 10 .730
group1.00
2.00
3.00
heightStatistic df Sig. Statistic df Sig.
Kolmogorov-Smirnova
Shapiro-Wilk
This is a lower bound of the true significance.*.
Lilliefors Significance Correctiona.
Statistical test for departure from normality
Statistical evidence, known as significance levels or P-values
For the time being:- If values are > 0.05, interpret as normality assumption is valid;- If values are < 0.05, interpret as normality assumption is not valid, and the variable does not follow a normal distribution.
There really isn’t much difference between RExcel and SPSS
Data checking
Omega 3 consumption and mathematical abilities
• Students from 3 schools participated in a study to assess the effects of omega 3 on mathematical abilities
• For each student, there is information on:
• school
• gender
• marks before
• marks after
• daily omega 3 consumption (mg)
Zeroes are important to take note of, but how do we decide whether they are plausible values or problematic values?
For flagging outliers in boxplots
Now we need to exclude these datapoints
• realise that data exploration prior to formal statistical analysis is important;
• know what to look out for in data checking of categorical variables
• know what to look out for in data checking of numerical variables
• understand the use of frequencies (percentages) for categorical data summary
• understand which location and variability metrics to use for numerical data
• understand the use and interpretation of histograms
Students should be able to
• interpret boxplots, for variable summary and for graphical comparisons
• know the usage and interpretation of scatterplots
• perform data entry in RExcel and SPSS
• perform exploratory data analysis in RExcel and SPSS
• identify and remove problematic data in RExcel and SPSS
• generate useful summary tables and figures for a dataset in investigating research hypotheses
• interpret generated summary tables and figures of a dataset for investigating research hypotheses
Cont...