Eda sri
-
Upload
mudumby-srinath -
Category
Technology
-
view
68 -
download
0
Transcript of Eda sri
Exploratory Data Analysis
M. Srinath
2
Exploratory Data AnalysisIntroduction• Exploratory data analysis was promoted by John Tukey in 1977 to
encourage statisticians visually to examine their data sets, to formulate hypotheses that could be tested on data-sets
• Exploratory data analysis (EDA) is an approach for analysing data to summarize the main characteristics of variables in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis
• EDA techniques are generally graphical. They include scatter plots, Stem and leaf plots, box plots, histograms, quantile plots, residual plots, and mean plots
• Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate)
•
3
Exploratory Data Analysis• EDA offers several techniques to comprehend data• But EDA is more than a library of data analysis techniques• EDA is an approach to data analysis• EDA involves inspecting data without any assumptions
– Mostly using information graphics
4
Exploratory Data AnalysisUnivariate non-graphical EDA Categorical data
Only useful univariate non-graphical techniques for categorical variables is some form of tabulation of the frequencies, usually along with calculation of the fraction (or percent) of data that falls in each category
Quantitative data Univariate non-graphical EDA focuses, generally, on measures of
central tendency(mean, median & mode), quartiles, spread(variance, sd & IQR), skewness and kurtosis
These descriptives quantitatively describe the main features of data
5
Univariate non-graphical EDA
A typical output of Descriptive Statistics
Variable : Di (Development index)
N Valid 1201Missing 0
Mean 0.260333Median 0.261697Mode 0.214959Std. Deviation 0.086778Skewness 0.086567Std. Error of Skewness 0.070593Kurtosis -0.88541Std. Error of Kurtosis 0.14107Percentiles 25 0.186396
50 0.26169775 0.330004
• When data has outliers median is more robust
• When data distribution is skewed median is more meaningful
• IQR = .0.143608
• IQR is also a robust measure of spread
6
Univariate graphical EDA -Histogram
• Graphical display of frequency distribution– Counts of data falling in various
ranges (bins)– Histogram for numeric data
• Bin size selection is important– Too small – may show spurious
patterns– Too large – may hide important
patterns• Several Variations possible
– Plot relative frequencies instead of raw frequencies
– Make the height of the histogram equal to the ‘relative frequency/width’
• Area under the histogram is 1• When observations come from
continuous scale histograms can be approximated by continuous curves
7
Stem and Leaf Plot• This plot organizes data for
easy visual inspection– Min and max values– Data distribution
• Unlike descriptive statistics, this plot shows all the data– No information loss– Individual values can be
inspected• Structure of the plot
– Stem – the digits in the largest place (e.g. tens place)
– Leaves – the digits in the smallest place (e.g. ones place)
– Leaves are listed to the left of stem separated by ‘|’
• Possible to place leaves from another data set to the right of the stem for comparing two data distributions
29, 44, 12, 53, 21, 34, 39, 25, 48, 23, 17, 24, 27, 32, 34, 15, 42, 21, 28, 37
Stem and Leaf Plot
1 | 2 7 5
2 | 9 1 5 3 4 7 1 8
3 | 4 9 2 4 7
4 | 4 8 2
5 | 3
Data
8
Stem and leaf plotDi Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 0 . & 10.00 0 . 999& 32.00 1 . 0000001111 66.00 1 . 2222222222223333333333 59.00 1 . 4444444445555555555 104.00 1 . 66666666666666666777777777777777777 81.00 1 . 888888888888899999999999999 76.00 2 . 00000000000000011111111111 82.00 2 . 2222222222222333333333333333 82.00 2 . 444444444444444445555555555 96.00 2 . 66666666666667777777777777777777 91.00 2 . 888888888888888899999999999999 79.00 3 . 00000000000000111111111111 90.00 3 . 222222222222223333333333333333 82.00 3 . 4444444444444555555555555555 67.00 3 . 6666666666667777777777 38.00 3 . 888888889999 33.00 4 . 00000011111 18.00 4 . 222233 9.00 4 . 445 4.00 4 . 6& 1.00 4 . &
Stem width: .1000000 Each leaf: 3 case(s) & denotes fractional leaves.
9
Box Plot• A five value summary plot of
data– Minimum, maximum– Median– 1st and 3rd quartiles
• Often used in conjunction with a histogram in EDA
• Structure of the plot– Box represents the IQR (the
middle 50% values)– The horizontal line in the box
shows the median– Vertical lines extend above
and below the box – Ends of vertical lines called
whiskers indicate the max and min values
• If max and min fall within 1.5*IQR
– Shows outliers above/below the whiskers
10
Quantile-Normal plot• Used to see how well a
particular sample follows a particular theoritical distribution
• Many statistical tests have the assumption that the outcome for any set of values of the explanatory variables is approximately normally distributed, and that is why QN plots are useful: if the assumption is grossly violated, the p-value and confidence intervals of those tests are wrong
11
Scatter Plot• Scatter plots are two
dimensional graphs with – explanatory attribute
plotted on the x-axis– Response attribute plotted
on the y-axis• Useful for understanding
the relationship between two attributes
• Features of the relationship – strength– shape (linear or curve)– Direction– Outliers
12
Scatter Plot Matrix
• When multiple attributes need to be visualized all at once– Scatter plots are drawn
for every pair of attributes and arranged into a 2D matrix.
• Useful for spotting relationships among attributes– Similar to a scatter plot– Attributes are shown on
the diagonal
13
Cross tabulation• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-tabulation is very useful.
• For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.
• The two variables might be both explanatory, both outcome, or one of each. Depending on the goals, row percentages (which add to 100% for each row), column percentages (which add to 100% for each column) and/or cell percentages (which add to 100% over all cells) are also useful.
• Cross-tabulation can be extended to three (and sometimes more) variables by making separate two-way tables for two variables at each level of a third variable. Cross-tabulation is the basic bivariate non-graphical EDA technique.
14
Cross tabulationMainOccupation * Castehierarchy Crosstabulation Castehierarchy TotalMainOccupation SC/ST BackwardOBC Upper casteLabour Count 148 33 35 12 228
% within MainOccupation 64.9 14.5 15.4 5.3 100.0% within Castehierarchy 42.2 26.2 15.0 2.4 19.0% of Total 12.3 2.7 2.9 1.0 19.0
Business Count 20 6 26 26 78% within MainOccupation 25.6 7.7 33.3 33.3 100.0% within Castehierarchy 5.7 4.8 11.1 5.3 6.5% of Total 1.7 0.5 2.2 2.2 6.5
Service Count 21 5 4 37 67% within MainOccupation 31.3 7.5 6.0 55.2 100.0% within Castehierarchy 6.0 4.0 1.7 7.6 5.6% of Total 1.7 0.4 0.3 3.1 5.6
Farming Count 162 82 169 415 828% within MainOccupation 19.6 9.9 20.4 50.1 100.0% within Castehierarchy 46.2 65.1 72.2 84.7 68.9% of Total 13.5 6.8 14.1 34.6 68.9Count 351 126 234 490 1201% within MainOccupation 29.2 10.5 19.5 40.8 100.0% within Castehierarchy 100.0 100.0 100.0 100.0 100.0% of Total 29.2 10.5 19.5 40.8 100.0
15
Univariate statistics by category
• For one categorical variable (usually explanatory) and one quantitative variable (usually outcome), it is common to produce some of the standard univariate non-graphical statistics for the quantitative variables separately for each level of the categorical variable, and then compare the statistics across levels of the categorical variable
Univariate statistics of Di by category Statecode Mean SD Median Min Max Skewness Kurtosis Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279 Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341 Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584 HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755 Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150 Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878 Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428 Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048 Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854
16
Univariate graph by category Bar plot
17
Univariate graph by category Box plot
18
EDA summary• All the techniques presented so far are the
tools useful for EDA• But without an understanding built from the
EDA, effective use of tools is not possible• EDA helps to answer a lot of questions
– What is a typical value?– What is the uncertainty of a typical value? – What is a good distributional fit for the data?– What are the relationships between two
attributes?– etc
19
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey
The obvious is that which is never seen until someone expresses it simply.
Kahlil Gibran
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone expresses it simply.
Kahlil Gibran
The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone expresses it simply.
Kahlil Gibran
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone expresses it simply.
Kahlil Gibran
The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone expresses it simply.
Kahlil Gibran