7/29/2019 EDA in the Data Analysis Process
1/30
SADC Course in Statistics
Exploratory DataAnalysis (EDA) in thedata analysis process
Module B2 Session 13
7/29/2019 EDA in the Data Analysis Process
2/30
To put your footer here go to View > Header and Footer 2
Learning Objectives
students should be able to Construct a dot plot for a numeric variable
split by a categorical variable
Apply EDA concepts to a large dataset
Explain the use of Excels pivot tables and filters, in the EDA process
Explain the importance of EDA
for data checking and at the start of the analysis
Relate EDA to the principles of official statistics .
7/29/2019 EDA in the Data Analysis Process
3/30
To put your footer here go to View > Header and Footer 3
EDA with small and large data sets
Session 12:
Stressed the importance of EDA
Introduced 2 new tools (dot and stem)
Practiced with small data sets
In this session we scale up Look at large data sets
The tools do not scale up easily
But the concepts do scale up EDA becomes even more crucial
Most data sets are large! at least compared with teaching examples
7/29/2019 EDA in the Data Analysis Process
4/30
To put your footer here go to View > Header and Footer 4
The essence of a stem and leaf plot
Stem and leaf plotStacked dot plot
The leafshows thenext digit.
This can beuseful in theexploration
phase
data5.35.46.0..
11.111.9
7/29/2019 EDA in the Data Analysis Process
5/30
To put your footer here go to View > Header and Footer 5
What are the key points?
We look at individual data points not summaries at this stage this is general for EDA
The stem and leaf plot in particular
keeps the actual numbers as far as possible This can be important
An example uses the Tanzania survey
7/29/2019 EDA in the Data Analysis Process
6/30
To put your footer here go to View > Header and Footer 6
Tanzania agriculture survey
This is the variable we wish to explore.
It is a value between 0 and 100
7/29/2019 EDA in the Data Analysis Process
7/30To put your footer here go to View > Header and Footer 7
The data in Excel
The variable to explore before analysis
7/29/2019 EDA in the Data Analysis Process
8/30To put your footer here go to View > Header and Footer 8
How to explore this value
Can we do a stem and leaf plot? By hand in Excel but there are 16628 values!
Even if automated, that is too many!
The essence of a stem and leaf plot is to look at all the possible values
Try a pivot table a powerful feature in Excel
used previously on categorical data
7/29/2019 EDA in the Data Analysis Process
9/30To put your footer here go to View > Header and Footer 9
The pivot table
7/29/2019 EDA in the Data Analysis Process
10/30To put your footer here go to View > Header and Footer 10
Some results
7/29/2019 EDA in the Data Analysis Process
11/30To put your footer here go to View > Header and Footer 11
7/29/2019 EDA in the Data Analysis Process
12/30To put your footer here go to View > Header and Footer 12
What do you deduce?
There are oddities in rounding Perhaps enumerator differences Can this question be answered to 1%?
So what should be done before analysis?
First look further at the data
Excel can help it can drill down to
examine individual records
The concept: Use the table to look for oddities
Then examine them in more detail
7/29/2019 EDA in the Data Analysis Process
13/30To put your footer here go to View > Header and Footer 13
Drilling down an example
Make the 6 corresponding to
2% the active cell
Then double click to give thedetail
4 of these values are from the samevillage so same enumerator
7/29/2019 EDA in the Data Analysis Process
14/30To put your footer here go to View > Header and Footer 14
7/29/2019 EDA in the Data Analysis Process
15/30To put your footer here go to View > Header and Footer 15
What do you conclude technique/results
Technique Stem and leaf plots when looking at small datasets Pivot tables when datasets are large
But the principle is general
Numbers must be looked at carefully!
The principle can be adapted for the data
and explored effectively in Excel
Results
Did enumerators have different interpretations of the precision required in the percentages
This needs further exploration
and the analysis needs to take account of this
7/29/2019 EDA in the Data Analysis Process
16/30To put your footer here go to View > Header and Footer 16
Another new element in this session
Exploratory analysis includes looking for oddities in the data
Unexplained oddities cause variation that can make it difficult to detect the pattern
because they add unnecessary noise to the data
How do you tame the variation
One way is to examine related variables
This is important in the analysis the next slide is a repeat from Session 3
It is also a key weapon in data exploration and is covered in the practical
7/29/2019 EDA in the Data Analysis Process
17/30To put your footer here go to View > Header and Footer 17
Slide from Module B2 Session 3
To do good statistics you must
fight the curse of variation
Two main strategies to overcome variation
1. Take enough observations
In the Tanzania survey there were 3223 householdsjust from this one region
2. Measure characteristics that explain
variation
Variation itself is not necessarily the problem Variation you do not understand is the problem
Here we start understanding variation at the exploration stage
7/29/2019 EDA in the Data Analysis Process
18/30
To put your footer here go to View > Header and Footer 18
Practical three parts
Tanzania data practice what has been done in these slides
Dot plots split by a factor
demonstration and practice Swaziland data
apply the concepts
checking factors
as well as numeric columns
Then the key points are reviewed
7/29/2019 EDA in the Data Analysis Process
19/30
To put your footer here go to View > Header and Footer 19
Points for review after the practical
Looking for individual problems And surprising patterns
Exploratory graphics need to help the analyst and data checker
see dot plots on next slide
Tables are also useful especially with the facility to drill down
Look at individual variables
and at records as a whole
Trust your common sense It is useful to estimate results
And question the computer if they are very different
7/29/2019 EDA in the Data Analysis Process
20/30
To put your footer here go to View > Header and Footer 20
Dot plots - yield by variety
Outliers (typing errors) are clear, but onlybecause of the 2nd variable
They are not outliers overall
7/29/2019 EDA in the Data Analysis Process
21/30
To put your footer here go to View > Header and Footer 21
EDA is a continuous process
EDA effectively is a continuation of the
data checking process The example on the previous slide shows
how some oddities only become clear once the
analysis is undertaken
This continues into the formal analysis where it involves looking at the residuals
They are the unexplained variation
As discussed in Session 3!
So analysis is not just a set of rules It is a thoughtful process
Where you become the data detective!
7/29/2019 EDA in the Data Analysis Process
22/30
To put your footer here go to View > Header and Footer 22
Swaziland data was for checking
7/29/2019 EDA in the Data Analysis Process
23/30
To put your footer here go to View > Header and Footer 23
Investigating the column called Presence
What does 0 mean?
Why are there blanks?
Next steps:
1. Look at thequestionnaire
2. Select these records
You are becoming detectives!
7/29/2019 EDA in the Data Analysis Process
24/30
To put your footer here go to View > Header and Footer 24
Codes for the column
Seems clear enough. Zeros and blanks still a puzzle
7/29/2019 EDA in the Data Analysis Process
25/30
To put your footer here go to View > Header and Footer 25
Selecting the blank records
i.e. serious problems with the whole record
Missing also
Too young and
all the same
Crop code not recognised Areas too large
7/29/2019 EDA in the Data Analysis Process
26/30
To put your footer here go to View > Header and Footer 26
Dot plot of area by Presence
Odd crop areas were ALL associated with oddcodes for the column PRESENCE
It was found to be a data transfer problem
with one byte missing in these records
7/29/2019 EDA in the Data Analysis Process
27/30
To put your footer here go to View > Header and Footer 27
Checking data quality and EDA
Where Why How By Whom
Before data
entry
To ensure
complete data
set received
Manual
check
supervisor
During data
entry
To highlight
anomalies
Filter, dot
plots etc
Supervisor
and helpers
Before
analysis
Double check As above Analyst/
statistician
During
analysis
Remain critical Residuals Analyst/
statistician
7/29/2019 EDA in the Data Analysis Process
28/30
To put your footer here go to View > Header and Footer 28
Importance principles of official statistics
Principle 2: Professional standards
It is unprofessional to analyse the data and reportresults without exploring critically at all stages
Principle 4: Prevention of misuse We risk misusing the data unless we explore the data
critically
Principle 5: Sources of statistics Includes a requirement to avoid undue burden on
respondents
We must process the data fully and effectively. Thisneeds EDA
Otherwise the burden imposed on respondents is to
some extent wasted
7/29/2019 EDA in the Data Analysis Process
29/30
To put your footer here go to View > Header and Footer 29
Can you now:
Apply EDA concepts to a large dataset
Explain the importance of EDA for data
checking and at the start of the analysis
Relate EDA to the principles of official
statistics
7/29/2019 EDA in the Data Analysis Process
30/30
Now you can organise the data for analysis
And then do an exploratory analysis
We show next how the analysis is easyIF your objectives are clear
Top Related