Eda sri

Exploratory Data Analysis

M. Srinath

2

Exploratory Data AnalysisIntroduction• Exploratory data analysis was promoted by John Tukey in 1977 to

encourage statisticians visually to examine their data sets, to formulate hypotheses that could be tested on data-sets

• Exploratory data analysis (EDA) is an approach for analysing data to summarize the main characteristics of variables in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis

• EDA techniques are generally graphical. They include scatter plots, Stem and leaf plots, box plots, histograms, quantile plots, residual plots, and mean plots

• Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate)

•

3

Exploratory Data Analysis• EDA offers several techniques to comprehend data• But EDA is more than a library of data analysis techniques• EDA is an approach to data analysis• EDA involves inspecting data without any assumptions

– Mostly using information graphics

4

Exploratory Data AnalysisUnivariate non-graphical EDA Categorical data

Only useful univariate non-graphical techniques for categorical variables is some form of tabulation of the frequencies, usually along with calculation of the fraction (or percent) of data that falls in each category

Quantitative data Univariate non-graphical EDA focuses, generally, on measures of

central tendency(mean, median & mode), quartiles, spread(variance, sd & IQR), skewness and kurtosis

These descriptives quantitatively describe the main features of data

5

Univariate non-graphical EDA

A typical output of Descriptive Statistics

Variable : Di (Development index)

N Valid 1201Missing 0

Mean 0.260333Median 0.261697Mode 0.214959Std. Deviation 0.086778Skewness 0.086567Std. Error of Skewness 0.070593Kurtosis -0.88541Std. Error of Kurtosis 0.14107Percentiles 25 0.186396

50 0.26169775 0.330004

• When data has outliers median is more robust

• When data distribution is skewed median is more meaningful

• IQR = .0.143608

• IQR is also a robust measure of spread

6

Univariate graphical EDA -Histogram

• Graphical display of frequency distribution– Counts of data falling in various

ranges (bins)– Histogram for numeric data

• Bin size selection is important– Too small – may show spurious

patterns– Too large – may hide important

patterns• Several Variations possible

– Plot relative frequencies instead of raw frequencies

– Make the height of the histogram equal to the ‘relative frequency/width’

• Area under the histogram is 1• When observations come from

continuous scale histograms can be approximated by continuous curves

7

Stem and Leaf Plot• This plot organizes data for

easy visual inspection– Min and max values– Data distribution

• Unlike descriptive statistics, this plot shows all the data– No information loss– Individual values can be

inspected• Structure of the plot

– Stem – the digits in the largest place (e.g. tens place)

– Leaves – the digits in the smallest place (e.g. ones place)

– Leaves are listed to the left of stem separated by ‘|’

• Possible to place leaves from another data set to the right of the stem for comparing two data distributions

29, 44, 12, 53, 21, 34, 39, 25, 48, 23, 17, 24, 27, 32, 34, 15, 42, 21, 28, 37

Stem and Leaf Plot

1 | 2 7 5

2 | 9 1 5 3 4 7 1 8

3 | 4 9 2 4 7

4 | 4 8 2

5 | 3

Data

8

Stem and leaf plotDi Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 0 . & 10.00 0 . 999& 32.00 1 . 0000001111 66.00 1 . 2222222222223333333333 59.00 1 . 4444444445555555555 104.00 1 . 66666666666666666777777777777777777 81.00 1 . 888888888888899999999999999 76.00 2 . 00000000000000011111111111 82.00 2 . 2222222222222333333333333333 82.00 2 . 444444444444444445555555555 96.00 2 . 66666666666667777777777777777777 91.00 2 . 888888888888888899999999999999 79.00 3 . 00000000000000111111111111 90.00 3 . 222222222222223333333333333333 82.00 3 . 4444444444444555555555555555 67.00 3 . 6666666666667777777777 38.00 3 . 888888889999 33.00 4 . 00000011111 18.00 4 . 222233 9.00 4 . 445 4.00 4 . 6& 1.00 4 . &

Stem width: .1000000 Each leaf: 3 case(s) & denotes fractional leaves.

9

Box Plot• A five value summary plot of

data– Minimum, maximum– Median– 1st and 3rd quartiles

• Often used in conjunction with a histogram in EDA

• Structure of the plot– Box represents the IQR (the

middle 50% values)– The horizontal line in the box

shows the median– Vertical lines extend above

and below the box – Ends of vertical lines called

whiskers indicate the max and min values

• If max and min fall within 1.5*IQR

– Shows outliers above/below the whiskers

10

Quantile-Normal plot• Used to see how well a

particular sample follows a particular theoritical distribution

• Many statistical tests have the assumption that the outcome for any set of values of the explanatory variables is approximately normally distributed, and that is why QN plots are useful: if the assumption is grossly violated, the p-value and confidence intervals of those tests are wrong

11

Scatter Plot• Scatter plots are two

dimensional graphs with – explanatory attribute

plotted on the x-axis– Response attribute plotted

on the y-axis• Useful for understanding

the relationship between two attributes

• Features of the relationship – strength– shape (linear or curve)– Direction– Outliers

12

Scatter Plot Matrix

• When multiple attributes need to be visualized all at once– Scatter plots are drawn

for every pair of attributes and arranged into a 2D matrix.

• Useful for spotting relationships among attributes– Similar to a scatter plot– Attributes are shown on

the diagonal

13

Cross tabulation• For categorical data (and quantitative data with only a few

different values) an extension of tabulation called cross-tabulation is very useful.

• For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.

• The two variables might be both explanatory, both outcome, or one of each. Depending on the goals, row percentages (which add to 100% for each row), column percentages (which add to 100% for each column) and/or cell percentages (which add to 100% over all cells) are also useful.

• Cross-tabulation can be extended to three (and sometimes more) variables by making separate two-way tables for two variables at each level of a third variable. Cross-tabulation is the basic bivariate non-graphical EDA technique.

14

Cross tabulationMainOccupation * Castehierarchy Crosstabulation Castehierarchy TotalMainOccupation SC/ST BackwardOBC Upper casteLabour Count 148 33 35 12 228

% within MainOccupation 64.9 14.5 15.4 5.3 100.0% within Castehierarchy 42.2 26.2 15.0 2.4 19.0% of Total 12.3 2.7 2.9 1.0 19.0

Business Count 20 6 26 26 78% within MainOccupation 25.6 7.7 33.3 33.3 100.0% within Castehierarchy 5.7 4.8 11.1 5.3 6.5% of Total 1.7 0.5 2.2 2.2 6.5

Service Count 21 5 4 37 67% within MainOccupation 31.3 7.5 6.0 55.2 100.0% within Castehierarchy 6.0 4.0 1.7 7.6 5.6% of Total 1.7 0.4 0.3 3.1 5.6

Farming Count 162 82 169 415 828% within MainOccupation 19.6 9.9 20.4 50.1 100.0% within Castehierarchy 46.2 65.1 72.2 84.7 68.9% of Total 13.5 6.8 14.1 34.6 68.9Count 351 126 234 490 1201% within MainOccupation 29.2 10.5 19.5 40.8 100.0% within Castehierarchy 100.0 100.0 100.0 100.0 100.0% of Total 29.2 10.5 19.5 40.8 100.0

15

Univariate statistics by category

• For one categorical variable (usually explanatory) and one quantitative variable (usually outcome), it is common to produce some of the standard univariate non-graphical statistics for the quantitative variables separately for each level of the categorical variable, and then compare the statistics across levels of the categorical variable

Univariate statistics of Di by category Statecode Mean SD Median Min Max Skewness Kurtosis Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279 Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341 Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584 HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755 Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150 Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878 Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428 Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048 Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854

16

Univariate graph by category Bar plot

17

Univariate graph by category Box plot

18

EDA summary• All the techniques presented so far are the

tools useful for EDA• But without an understanding built from the

EDA, effective use of tools is not possible• EDA helps to answer a lot of questions

– What is a typical value?– What is the uncertainty of a typical value? – What is a good distributional fit for the data?– What are the relationships between two

attributes?– etc

19

The greatest value of a picture is when it forces us to notice what we never expected to see.

— John W. Tukey

The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey

The obvious is that which is never seen until someone expresses it simply.

Kahlil Gibran


— John W. Tukey


Kahlil Gibran



— John W. Tukey


Kahlil Gibran


— John W. Tukey


Kahlil Gibran



— John W. Tukey


Kahlil Gibran

Eda sri

Technology

Transcript of Eda sri