Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

1

Exploratory Data Analysis

Disclaimer


2

This module on Exploratory Data Analysis is being offered free of charge to the interested individuals who wish to learn more about using these tools to understand their datasets, better.

Usage of these tools is recommended with the help of a mentor. Please speak to us at [email protected], should you need our mentoring on Exploratory Data Analysis.

Reproducing this module or distributing or selling it to achieve financial benefits will invite stringent action under the concerned law of jurisdiction by the institution facilitating this module.

mailto:[email protected]

Body of knowledge


3

1. Stem and Leaf Plot

2. Box Plot

3. Median Polish

4. Resistant Line

5. Resistant Smooth

6. Rootogram

Introduction to Exploratory Data Analysis


4

Exploratory Data Analysis is an approach that has a list of techniques which can be used to understand the data better without the need to use significance or confidence level testing.

Uses of Exploratory Data Analysis are as below:

1. Get detailed insight into your dataset.

2. Understand some critical impact variables that influence the dataset.

3. Detect if any outliers are present in the dataset.

4. Test the underlying assumptions of’ the dataset.

Exploratory Data Analysis can be done in a matter of 3 minutes using Minitab or any other statistical software package.

Be surprised though --- We will use Microsoft Excel ® to complete these tools.


5

Stem and Leaf Plot

Steam and Leaf Plot


6

A contact center quality team evaluates 100 calls in the contact center. The Quality Manager decides to review the quality scores of the operations floor. Let us draw a stem and leaf plot to understand the data. A snapshot of the data sheet is attached here. This data sheet can be found in the file EDA.xls.

Steam and Leaf Plot


7

Step 1 – Sort the data in ascending order. Step 2 – Find out the minimum and maximum values using the MIN and the MAX function Step 3 – Find out the range using the formula MAX – MIN Step 4 – Construct the stems starting from 0 and ending with 8. Rule for constructing stems – If you have a data set with 3 digit values, the stems would need to be constructed in accordance to the hundredth place.

Steam and Leaf Plot


8

Step 5 – We need to write the formula to compute leafs. For example, let us take the Stem 3 highlighted in Yellow background. We need to count how many values fall greater than 30. Let us first write the formula to count the values that are 30.

Press Enter. See how the Leaf shows up as 0. Now, this means we have one value of 30.

Let us change the first value of the dataset to 30 – For sake of simulations!! As we see here, you now have two values of 30. So, the formula works!!

Steam and Leaf Plot


9

Step 6 – Let us now build the formula which will count all the numbers in the series of 30-40, i.e. 31, 32, 33, 34, 35, and so on.

Huh! That formula seems to never end, does it? Well do it just once and then it would be easy. But yes, it is some pain and worth it!!

Steam and Leaf Plot


10

Step 7 – The Stem and Leaf Plot as shown here. Step 8 – Let us the LEN and SUBSTITUTE formula together to add the interpretation.

Stem and Leaf Plot


11

1. You have an easier option to run a macro to generate the Stem and Leaf Plot, but VBA coding is not everyone’s cup of tea.

2. You could use some statistical software but that may turn out to be slightly expensive.

3. With the use of some simple Excel formulas, you have discovered tool 1 which is used to show granularity in information in the dataset.

4. That is the Steam and Leaf Plot for you.


12

Box Plot

Box Plot


13

Granularity as provided by the Stem and Leaf Plot is good, but at times you need a graph that shows the data shape, its distribution and the spread. That’s where we use the Box Plot. Let us draw a Box plot to understand the data. 5 teams of a factory produce homogenous units. The sampled cycle times are shown as below.

Box Plot


14

Step 1 – Let us setup the table as seen here. We know how to calculate the Minimum and Maximum value. Step 2 – Calculate the Median, and the Quartile values using the formulas below Median: = MEDIAN() 1st Quartile: = PERCENTILE(Data range, 25%) 3rd Quartile: = PERCENTILE(Data range, 75%)

Box Plot


15

Step 3 – Although you have prepared the basic data needed, we aren’t ready to draw the Box Plot yet. We need to prepare another table, one that is shown here. Step 4 – In the row titled Series 1, fetch the minimum values for the Teams. In the row titled Series 2, subtract the Minimum value from the 1st Quartile value from the Summary Range table.

Box Plot


16

Step 5– In the row titled Series 3, subtract the 1st Quartile value from the Median value. In the row titled Series 4, subtract the Median from the 3rd Quartile value. In the row titled Series 5, subtract the 3rd Quartile value from the Maximum value. Let us now try to draw the Box Plot.

Box Plot


17

Step 6– Select data from Series to Series 4. Don’t select Series 5 as of yet. We will do it later. Select 2D Column – Stacked Column Chart.

Box Plot


18

Step 7– Obviously the chart is not a completed Box Plot. We need to work around a few things on Excel. Let us first hide the Series 1 in the graph generated. To do this, right click on Series 1 on the graph. Click on Format Data Series. Click on Fill. Select No fill. Click on Border Color. Select No color. See how the blue bars for Series 1 go away.

Box Plot


19

Step 8– Repeat the same steps as in Step 7 discussed in the previous slide but leave the cursor selected on the axis of Series 2. Step 9 – We need to define the Whiskers. To do that, Click on Layout, click on Error Bars and click on More Error Bar options. Step 10 – In the dialog window box that opens up, select Minus for Direction and change the percentage to 100.

Box Plot


20

Step 11– After doing Step 9 and Step 10, the graph changes shape to what is seen here. Take a look at the graph. Step 12 – Repeat steps 9 and 10 for Series 4. A small change. In the More Error bars options, select the Direction to Plus. You will see how the lower and upper whiskers are defined now.

Box Plot


21

Step 11– After doing Step 9 and Step 10, the graph changes shape to what is seen here. Take a look at the graph. Step 12 – Repeat steps 9 and 10 for Series 4. A small change. In the More Error bars options, select the Direction to Plus. You will see how the lower and upper whiskers are defined now.

Box Plot


22

Step 13– Oops something went wrong with the graph here. We have not defined the Maximum values here. Step 14 – Click on the lines at the top. Click on Layout, Click on More Error Bars and in the window that opens up, select Custom and specify values. Select the maximum values from the data for chart table, aka Series 5.

Box Plot


23

Step 15– The Box Plot is ready now. We can now start interpreting. Obviously we spent some time making this Box Plot, but it is a one time effort. Once you are able to construct this, you can use this as a Box Plot Template.

Box Plot Interpretation 1. The Median cycle time for Team C seems the

lowest at approximately 20 minutes.

2. Team A shows the greatest spread in data.

3. Data for Team A is also heavily skewed.

4. Team E seems to have a good % of population in the lower end of the cycle time.

Box Plot


24

1. Box Plot doesn’t confirm anything. It is thus not a confirmatory data analysis tool.

2. Given the fact that a Box Plot is able to tell you information about central tendency, spread and shape of the data, you can use this EDA tool pretty much everywhere you have stratified data.

3. You can also use this tool where you just have one sample of data and you wish to study properties of that sample.


25

Median Polish

Median Polish


26

In Inferential statistics, Analysis of Variance is a Hypothesis testing measure that fits an additive model to a 2-way design and identifies data patterns not explained by Row and Column variable effects. Median Polish does a similar thing except that Median Polish will use Medians. A company wishes to conduct a Median Polish on the percentage scores achieved by students in each course of an IT institution.

Table 1

Median Polish


27

Step 1 – First find out the medians of all the course scores individually and subtract the individual mean performance scores from the median. This is known as the 1st sweep. Step 2 – Now, do the 2nd sweep. In the second sweep, subtract the median from table 2 (Last row) and the Row median from table 2 (Last column) (Both highlighted) from the table values of table 1. For the column median, subtract 2nd Sweep value for any cell with the corresponding cell in 1st sweep.

Table 2

Table 3

Median Polish


28

Step 3 – Let’s do the 3rd sweep now. Subtract the row values obtained in table 3 from the row medians. Identify the new column medians in the 3rd sweep itself. The new row medians = Change Median – Median from table 3.

Table 4

Step 4 – Time for the 4th sweep. Subtract all the row value in table 4 from the 3rd sweep column median. This will give you the row values for new table which we would be constructing. Also add the Column Median value with the 3rd Sweep Column Median.

Median Polish


29

Table 5

Step 4 – Time for the 4th sweep. Subtract all the row value in table 4 from the 3rd sweep column median. This will give you the row values for new table which we would be constructing. Also add the Column Median value with the 3rd Sweep Column Median.

Median Polish


30

Table 6 – Final Residual Table

Median Polish


31

Interpretations 1. The average test score performance

across all the courses was 44.25%.

2. People who do JAVA programs alone score approximately 13 points less than those who do .NET.

3. Oh yes, look at the Column effects from the Residual table. Students with 90% attendance outscore the ones with 70% attendance by 5 points.

Median Polish


32

Final Notes 1. The tediousness of calculations shouldn’t shy you away from this wonderful

tool.

2. In a 2*2 design where there is a possibility that one of them is categorical, Median polish comes in very handy in establishing relationships.

3. With the power of calculating residuals with the Median Polish tool, you can also predict on what could happen in the future.


33

Histogram

Histogram


34

Histogram is another important EDA tool, which you can use when you wish to check the shape. Importantly, histogram will outline issues in the data like 1. Modality issues 2. Skew issues 3. Mixed distribution issues

Let us go back to the cycle time data and try to plot the histogram with the help of Excel.

Histogram


35

Step 1 – Let us first calculate the descriptive statistics measures for all the teams. As you can see from the table shown here, most of the formulas are basic except for the ones shaded in Light amber background. IQR = 3rd Quartile – 1st Quartile Bin width = 2*Count1/3

Number of bins = (Maximum – Minimum)/ Bin width

Histogram


36

Step 2 – Let us now define with the bins. Start with the minimum value. For example, for Team A the first bin would be 0.32. The next bin will be = 0.32+Bin Size (7.26). The third bin would be 7.53+ 7.26 and so on. Continue this until you reach 7 bins.

Histogram


37

Step 3 – Let us first draw the Histogram for one team’s metric performance, e.g. Team A. Steps to draw a Histogram 1. Click on Data. Click on Data Analysis (If this option is not available, please

insert the Data Analysis Add-in). 2. From the Data Analysis Dialog window, choose Histogram. 3. In the section showing Input variable, select data corresponding to Team A. 4. In the section showing Bin range, select Bin range corresponding to Team A. 5. Put a tick on Chart Output and Click Ok.

Histogram


38

We achieved this nice looking Histogram by reducing the Gap to 0% on the graph.

Histogram


39

Interpretations 1. Bi-modality observed at 7.53 and 56. Is

this due to an external issue?

2. If the Bi-modality is resolved, we’d get a close to a perfect distribution, but what is the reason for this bi-modality?

3. It could difference in suppliers, difference in changeovers, difference in raw materials --- Anything?

Rootogram


40

Interpretations 1. Introduction of a new tool here. Instead of having the frequencies on the

vertical axis, you can now take the square root of all the frequencies on the vertical axis and what you have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a Histogram.

Histogram


41

Based on the 4 Histograms drawn for each of the teams, what can you infer? Which team’s data distribution is close to being a normal distribution?

Rootogram


42

Interpretations 1. Introduction of a new tool here. Instead of having the frequencies on the

vertical axis, you can now take the square root of all the frequencies on the vertical axis and what you have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a Histogram.


43

Scatter Plot

Scatter Plot


44

Most times in projects we stumble upon the fact that x impact y. In other words, y = f(x). Now, using scatter plots, you can visually understand if there is a relationship between x and y. Let us use data for two variables – Machine downtime and production capacity for a factory to understand how does a scatter plot work. Downtime is expressed in % and Production Capacity is expressed in tons.

Scatter Plot


45

Step 1 – Select the data, Click on Insert, Click on Scatter and Click on Scatter with only markers. Step 2 – Voila – you are done. There you have the scatter chart as seen here.

Scatter Plot


46

Step 3 – Modification to a Regression equation This is where you can use an EDA tool as an Inferential statistics tool. Right click on any point in the graph and click on Add Trendline. Select Linear, Display equation and Display R-Square.

Scatter Plot


47

Step 4 – Interpretation While the scatter graph itself visually revealed absence of any strong correlation between downtime and production capacity, the regression statistics merely confirm. The R-Square value needs to be > 0.64 for us to conclude strong correlation.

Final Notes


48

1. This module covers most of the tools used in Exploratory data analysis.

2. Some other tools are: a. Parallel Coordinates b. Run Charts c. Odds Ratio d. Principal Components Analysis e. Ordination

Please write into us at [email protected] for usage of EDA tools if you have doubts or also follow us at Linkedin on The School of Continuous Improvement.

mailto:[email protected]

https://www.linkedin.com/company/6426519?trk=tyah&trkInfo=clickedVertical%3Acompany%2Cidx%3A1-1-1%2CtarId%3A1429506009411%2Ctas%3AThe%20School%20of%20Continuous%20Improvement

https://www.linkedin.com/company/6426519?trk=tyah&trkInfo=clickedVertical%3Acompany%2Cidx%3A1-1-1%2CtarId%3A1429506009411%2Ctas%3AThe%20School%20of%20Continuous%20Improvement

(C) The School of Continuous Improvement v1.0 49

Thank you….

Exploratory data analysis v1.0

Business

Transcript of Exploratory data analysis v1.0