Exploratory data analysis v1.0

49
(C) The School of Continuous Improvement v1.0 1 Exploratory Data Analysis

Transcript of Exploratory data analysis v1.0

Page 1: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

1

Exploratory Data Analysis

Page 2: Exploratory data analysis v1.0

Disclaimer

(C) The School of Continuous Improvement v1.0

2

This module on Exploratory Data Analysis is being offered free of charge to the interested individuals who wish to learn more about using these tools to understand their datasets, better.

Usage of these tools is recommended with the help of a mentor. Please speak to us at [email protected], should you need our mentoring on Exploratory Data Analysis.

Reproducing this module or distributing or selling it to achieve financial benefits will invite stringent action under the concerned law of jurisdiction by the institution facilitating this module.

Page 3: Exploratory data analysis v1.0

Body of knowledge

(C) The School of Continuous Improvement v1.0

3

1. Stem and Leaf Plot

2. Box Plot

3. Median Polish

4. Resistant Line

5. Resistant Smooth

6. Rootogram

Page 4: Exploratory data analysis v1.0

Introduction to Exploratory Data Analysis

(C) The School of Continuous Improvement v1.0

4

Exploratory Data Analysis is an approach that has a list of techniques which can be used to understand the data better without the need to use significance or confidence level testing.

Uses of Exploratory Data Analysis are as below:

1. Get detailed insight into your dataset.

2. Understand some critical impact variables that influence the dataset.

3. Detect if any outliers are present in the dataset.

4. Test the underlying assumptions of’ the dataset.

Exploratory Data Analysis can be done in a matter of 3 minutes using Minitab or any other statistical software package.

Be surprised though --- We will use Microsoft Excel ® to complete these tools.

Page 5: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

5

Stem and Leaf Plot

Page 6: Exploratory data analysis v1.0

Steam and Leaf Plot

(C) The School of Continuous Improvement v1.0

6

A contact center quality team evaluates 100 calls in the contact center. The Quality Manager decides to review the quality scores of the operations floor. Let us draw a stem and leaf plot to understand the data. A snapshot of the data sheet is attached here. This data sheet can be found in the file EDA.xls.

Page 7: Exploratory data analysis v1.0

Steam and Leaf Plot

(C) The School of Continuous Improvement v1.0

7

Step 1 – Sort the data in ascending order. Step 2 – Find out the minimum and maximum values using the MIN and the MAX function Step 3 – Find out the range using the formula MAX – MIN Step 4 – Construct the stems starting from 0 and ending with 8. Rule for constructing stems – If you have a data set with 3 digit values, the stems would need to be constructed in accordance to the hundredth place.

Page 8: Exploratory data analysis v1.0

Steam and Leaf Plot

(C) The School of Continuous Improvement v1.0

8

Step 5 – We need to write the formula to compute leafs. For example, let us take the Stem 3 highlighted in Yellow background. We need to count how many values fall greater than 30. Let us first write the formula to count the values that are 30.

Press Enter. See how the Leaf shows up as 0. Now, this means we have one value of 30.

Let us change the first value of the dataset to 30 – For sake of simulations!! As we see here, you now have two values of 30. So, the formula works!!

Page 9: Exploratory data analysis v1.0

Steam and Leaf Plot

(C) The School of Continuous Improvement v1.0

9

Step 6 – Let us now build the formula which will count all the numbers in the series of 30-40, i.e. 31, 32, 33, 34, 35, and so on.

Huh! That formula seems to never end, does it? Well do it just once and then it would be easy. But yes, it is some pain and worth it!!

Page 10: Exploratory data analysis v1.0

Steam and Leaf Plot

(C) The School of Continuous Improvement v1.0

10

Step 7 – The Stem and Leaf Plot as shown here. Step 8 – Let us the LEN and SUBSTITUTE formula together to add the interpretation.

Page 11: Exploratory data analysis v1.0

Stem and Leaf Plot

(C) The School of Continuous Improvement v1.0

11

1. You have an easier option to run a macro to generate the Stem and Leaf Plot, but VBA coding is not everyone’s cup of tea.

2. You could use some statistical software but that may turn out to be slightly expensive.

3. With the use of some simple Excel formulas, you have discovered tool 1 which is used to show granularity in information in the dataset.

4. That is the Steam and Leaf Plot for you.

Page 12: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

12

Box Plot

Page 13: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

13

Granularity as provided by the Stem and Leaf Plot is good, but at times you need a graph that shows the data shape, its distribution and the spread. That’s where we use the Box Plot. Let us draw a Box plot to understand the data. 5 teams of a factory produce homogenous units. The sampled cycle times are shown as below.

Page 14: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

14

Step 1 – Let us setup the table as seen here. We know how to calculate the Minimum and Maximum value. Step 2 – Calculate the Median, and the Quartile values using the formulas below Median: = MEDIAN() 1st Quartile: = PERCENTILE(Data range, 25%) 3rd Quartile: = PERCENTILE(Data range, 75%)

Page 15: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

15

Step 3 – Although you have prepared the basic data needed, we aren’t ready to draw the Box Plot yet. We need to prepare another table, one that is shown here. Step 4 – In the row titled Series 1, fetch the minimum values for the Teams. In the row titled Series 2, subtract the Minimum value from the 1st Quartile value from the Summary Range table.

Page 16: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

16

Step 5– In the row titled Series 3, subtract the 1st Quartile value from the Median value. In the row titled Series 4, subtract the Median from the 3rd Quartile value. In the row titled Series 5, subtract the 3rd Quartile value from the Maximum value. Let us now try to draw the Box Plot.

Page 17: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

17

Step 6– Select data from Series to Series 4. Don’t select Series 5 as of yet. We will do it later. Select 2D Column – Stacked Column Chart.

Page 18: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

18

Step 7– Obviously the chart is not a completed Box Plot. We need to work around a few things on Excel. Let us first hide the Series 1 in the graph generated. To do this, right click on Series 1 on the graph. Click on Format Data Series. Click on Fill. Select No fill. Click on Border Color. Select No color. See how the blue bars for Series 1 go away.

Page 19: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

19

Step 8– Repeat the same steps as in Step 7 discussed in the previous slide but leave the cursor selected on the axis of Series 2. Step 9 – We need to define the Whiskers. To do that, Click on Layout, click on Error Bars and click on More Error Bar options. Step 10 – In the dialog window box that opens up, select Minus for Direction and change the percentage to 100.

Page 20: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

20

Step 11– After doing Step 9 and Step 10, the graph changes shape to what is seen here. Take a look at the graph. Step 12 – Repeat steps 9 and 10 for Series 4. A small change. In the More Error bars options, select the Direction to Plus. You will see how the lower and upper whiskers are defined now.

Page 21: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

21

Step 11– After doing Step 9 and Step 10, the graph changes shape to what is seen here. Take a look at the graph. Step 12 – Repeat steps 9 and 10 for Series 4. A small change. In the More Error bars options, select the Direction to Plus. You will see how the lower and upper whiskers are defined now.

Page 22: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

22

Step 13– Oops something went wrong with the graph here. We have not defined the Maximum values here. Step 14 – Click on the lines at the top. Click on Layout, Click on More Error Bars and in the window that opens up, select Custom and specify values. Select the maximum values from the data for chart table, aka Series 5.

Page 23: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

23

Step 15– The Box Plot is ready now. We can now start interpreting. Obviously we spent some time making this Box Plot, but it is a one time effort. Once you are able to construct this, you can use this as a Box Plot Template.

Box Plot Interpretation 1. The Median cycle time for Team C seems the

lowest at approximately 20 minutes.

2. Team A shows the greatest spread in data.

3. Data for Team A is also heavily skewed.

4. Team E seems to have a good % of population in the lower end of the cycle time.

Page 24: Exploratory data analysis v1.0

Box Plot

(C) The School of Continuous Improvement v1.0

24

1. Box Plot doesn’t confirm anything. It is thus not a confirmatory data analysis tool.

2. Given the fact that a Box Plot is able to tell you information about central tendency, spread and shape of the data, you can use this EDA tool pretty much everywhere you have stratified data.

3. You can also use this tool where you just have one sample of data and you wish to study properties of that sample.

Page 25: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

25

Median Polish

Page 26: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

26

In Inferential statistics, Analysis of Variance is a Hypothesis testing measure that fits an additive model to a 2-way design and identifies data patterns not explained by Row and Column variable effects. Median Polish does a similar thing except that Median Polish will use Medians. A company wishes to conduct a Median Polish on the percentage scores achieved by students in each course of an IT institution.

Table 1

Page 27: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

27

Step 1 – First find out the medians of all the course scores individually and subtract the individual mean performance scores from the median. This is known as the 1st sweep. Step 2 – Now, do the 2nd sweep. In the second sweep, subtract the median from table 2 (Last row) and the Row median from table 2 (Last column) (Both highlighted) from the table values of table 1. For the column median, subtract 2nd Sweep value for any cell with the corresponding cell in 1st sweep.

Table 2

Table 3

Page 28: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

28

Step 3 – Let’s do the 3rd sweep now. Subtract the row values obtained in table 3 from the row medians. Identify the new column medians in the 3rd sweep itself. The new row medians = Change Median – Median from table 3.

Table 4

Step 4 – Time for the 4th sweep. Subtract all the row value in table 4 from the 3rd sweep column median. This will give you the row values for new table which we would be constructing. Also add the Column Median value with the 3rd Sweep Column Median.

Page 29: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

29

Table 5

Step 4 – Time for the 4th sweep. Subtract all the row value in table 4 from the 3rd sweep column median. This will give you the row values for new table which we would be constructing. Also add the Column Median value with the 3rd Sweep Column Median.

Page 30: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

30

Table 6 – Final Residual Table

Page 31: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

31

Interpretations 1. The average test score performance

across all the courses was 44.25%.

2. People who do JAVA programs alone score approximately 13 points less than those who do .NET.

3. Oh yes, look at the Column effects from the Residual table. Students with 90% attendance outscore the ones with 70% attendance by 5 points.

Page 32: Exploratory data analysis v1.0

Median Polish

(C) The School of Continuous Improvement v1.0

32

Final Notes 1. The tediousness of calculations shouldn’t shy you away from this wonderful

tool.

2. In a 2*2 design where there is a possibility that one of them is categorical, Median polish comes in very handy in establishing relationships.

3. With the power of calculating residuals with the Median Polish tool, you can also predict on what could happen in the future.

Page 33: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

33

Histogram

Page 34: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

34

Histogram is another important EDA tool, which you can use when you wish to check the shape. Importantly, histogram will outline issues in the data like 1. Modality issues 2. Skew issues 3. Mixed distribution issues

Let us go back to the cycle time data and try to plot the histogram with the help of Excel.

Page 35: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

35

Step 1 – Let us first calculate the descriptive statistics measures for all the teams. As you can see from the table shown here, most of the formulas are basic except for the ones shaded in Light amber background. IQR = 3rd Quartile – 1st Quartile Bin width = 2*Count1/3

Number of bins = (Maximum – Minimum)/ Bin width

Page 36: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

36

Step 2 – Let us now define with the bins. Start with the minimum value. For example, for Team A the first bin would be 0.32. The next bin will be = 0.32+Bin Size (7.26). The third bin would be 7.53+ 7.26 and so on. Continue this until you reach 7 bins.

Page 37: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

37

Step 3 – Let us first draw the Histogram for one team’s metric performance, e.g. Team A. Steps to draw a Histogram 1. Click on Data. Click on Data Analysis (If this option is not available, please

insert the Data Analysis Add-in). 2. From the Data Analysis Dialog window, choose Histogram. 3. In the section showing Input variable, select data corresponding to Team A. 4. In the section showing Bin range, select Bin range corresponding to Team A. 5. Put a tick on Chart Output and Click Ok.

Page 38: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

38

We achieved this nice looking Histogram by reducing the Gap to 0% on the graph.

Page 39: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

39

Interpretations 1. Bi-modality observed at 7.53 and 56. Is

this due to an external issue?

2. If the Bi-modality is resolved, we’d get a close to a perfect distribution, but what is the reason for this bi-modality?

3. It could difference in suppliers, difference in changeovers, difference in raw materials --- Anything?

Page 40: Exploratory data analysis v1.0

Rootogram

(C) The School of Continuous Improvement v1.0

40

Interpretations 1. Introduction of a new tool here. Instead of having the frequencies on the

vertical axis, you can now take the square root of all the frequencies on the vertical axis and what you have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a Histogram.

Page 41: Exploratory data analysis v1.0

Histogram

(C) The School of Continuous Improvement v1.0

41

Based on the 4 Histograms drawn for each of the teams, what can you infer? Which team’s data distribution is close to being a normal distribution?

Page 42: Exploratory data analysis v1.0

Rootogram

(C) The School of Continuous Improvement v1.0

42

Interpretations 1. Introduction of a new tool here. Instead of having the frequencies on the

vertical axis, you can now take the square root of all the frequencies on the vertical axis and what you have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a Histogram.

Page 43: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0

43

Scatter Plot

Page 44: Exploratory data analysis v1.0

Scatter Plot

(C) The School of Continuous Improvement v1.0

44

Most times in projects we stumble upon the fact that x impact y. In other words, y = f(x). Now, using scatter plots, you can visually understand if there is a relationship between x and y. Let us use data for two variables – Machine downtime and production capacity for a factory to understand how does a scatter plot work. Downtime is expressed in % and Production Capacity is expressed in tons.

Page 45: Exploratory data analysis v1.0

Scatter Plot

(C) The School of Continuous Improvement v1.0

45

Step 1 – Select the data, Click on Insert, Click on Scatter and Click on Scatter with only markers. Step 2 – Voila – you are done. There you have the scatter chart as seen here.

Page 46: Exploratory data analysis v1.0

Scatter Plot

(C) The School of Continuous Improvement v1.0

46

Step 3 – Modification to a Regression equation This is where you can use an EDA tool as an Inferential statistics tool. Right click on any point in the graph and click on Add Trendline. Select Linear, Display equation and Display R-Square.

Page 47: Exploratory data analysis v1.0

Scatter Plot

(C) The School of Continuous Improvement v1.0

47

Step 4 – Interpretation While the scatter graph itself visually revealed absence of any strong correlation between downtime and production capacity, the regression statistics merely confirm. The R-Square value needs to be > 0.64 for us to conclude strong correlation.

Page 48: Exploratory data analysis v1.0

Final Notes

(C) The School of Continuous Improvement v1.0

48

1. This module covers most of the tools used in Exploratory data analysis.

2. Some other tools are: a. Parallel Coordinates b. Run Charts c. Odds Ratio d. Principal Components Analysis e. Ordination

Please write into us at [email protected] for usage of EDA tools if you have doubts or also follow us at Linkedin on The School of Continuous Improvement.

Page 49: Exploratory data analysis v1.0

(C) The School of Continuous Improvement v1.0 49

Thank you….