PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o...

28
PEOPLE ANALYTICS TUTORIAL: Employee Churn

Transcript of PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o...

Page 2: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

In this tutorial, I will demonstrate how to build, evaluate and deployyour predictive model, using R. To start I will first briefly introduce my vision on employeechurn and summarize the data science process.

(Note: we assume you're familiar with R)

You may be asking what is Employee Churn? In a word -“turnover’- itswhen employees leave the organization. In another word- “terminates”,whether it be voluntary or involuntary. In the widest sensechurn/turnover is concerned both the calculation of rates of peopleleaving the organization and the individual terminates themselves.

Most of the focus in the past has been on the ‘rates’, not on theindividual terminates. We calculate past rates or turnover in an attemptto predict future turnover rates. And indeed it is important to do thatand to continue to do so. Data warehousing tools are very powerful inthis regard to slice and dice this data efficiently over different timeperiods at different levels of granularity. BUT it is only half the picture.

These rates only show the impact of churn/turnover in the ‘aggregate’.In addition to this, you might be interested in predicting exactly ‘who’ or‘which employees’ exactly may be at high risk for leaving theorganization. Hence the reason for being interested in the ‘individual’records in addition to the aggregate. Statistically speaking, ‘churn’ is‘churn’ regardless of context. It’s when a member of a populationleaves a population.

INTRODUCTION

Page 3: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

One of the examples you will see in Microsoft AzureML and in manydata science textbooks out there is ‘customer’ churn. This is from themarketing context. In many businesses such a cell phone companiesand others, it is far harder to generate and attract new customers thanit is to keep old ones. So businesses want to do what they can to keep existing customers.When they leave, that is ‘customer churn’ for that particular company.

There is applicability of this kind of thinking and mindset to HumanResources in an organization as well. It is far less expensive to ‘keep’good employees once you have them, then the cost of attracting andtraining new ones.

Hmmmmm- a marketing principle that applies to the management ofhuman resources, and a data science set of algorithms that can helpdetermine whether there are patterns of churn in our data that couldhelp predict future churn.

HR truly needs to start thinking outside of its traditional thinking andmethodologies to powerfully address the HR challenges and issues inthe future. On a personal level, I like to think of People Analytics as when thedata science process is applied to HR information. For that reason, Iwould like to revisit what that process is and use it as the framework toguide the rest of the example illustrated in this tutorial.

Page 4: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

THE DATA SCIENCEPROCESS REVISITED

DEFINE A GOAL As mentioned above, means identifying first what HR management business problem you are trying to solve. Without a problem/issue wedon’t have a goal.

BUILD THE MODELThis step really means, after you have defined the HR business problemor goal you are trying to achieve, you pick a data mining approach/toolthat is designed to address that type of problem. With Employee Churnyou are trying to predict who might leave as contrasted from those thatstay. The business problem/goal determine the appropriate data miningtools to consider. Not exhaustive as a list, but common data miningapproaches used in modelling are classification, regression, anomalydetection, time series, clustering, association analyses to name a few.These approaches take information/data as inputs, run them throughstatistical algorithms, and produce output.

1

3

COLLECT AND MANAGE DATAAt its simplest, you want a ‘dataset’ of information perceived to berelevant to the problem. The collection and management of data couldbe a simple extract from the corporate Human Resource InformationSystem, or an output from an elaborate Data Warehousing/BusinessIntelligence tool used on HR information. For purpose of this blog articleillustration we will use a simple CSV file. It also involves exploring thedata both for data quality issues, and for an initial look at what the datamay be telling you

2

Page 5: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

THE DATA SCIENCE PROCESS REVISITED

EVALUATE AND CRITIQUE MODELEach data mining approach can have many different statisticalalgorithms to bring to bear on the data. The evaluation is both whatalgorithms provide the most consistent accurate predictions on newdata, and do we have all the relevant data or do we need more types ofdata to increase predictive accuracy of model on new data. This can benecessarily repetitive and circular activity over time to improve themodel

DEPLOY MODELThe whole purpose of building the model (which is on existing data) isto:

*use the model on future data when it becomes available, to predict orprevent something from happening before it occurs or;

*to better understand our existing business problem to tailor morespecific responses

4

6

PRESENT RESULTS AND DOCUMENTWhen we have gotten out model to an acceptable, useful predictive level,we document our activity and present results. The definition ofacceptable and useful is really relative to the organization, but in allcases would mean results show improvement over what would havebeen otherwise. The principle behind data ‘science’ like any science, isthat with the same data, people should be able to reproduce ourfindings/ results.

5

Page 6: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

STEP 1 DEFINE GOAL

O u r h y p o t h e t i c a l c o m p a n y f o u n d t h a t i t s p r e v i o u sa p p l i c a t i o n o f P e o p l e A n a l y t i c s - a p p l y i n g t h e d a t as c i e n c e p r o c e s s t o o r g a n i z a t i o n a l a b s e n t e e i s m a s a ni s s u e y i e l d e d s o m e v a l u a b l e i n s i g h t s t h a t a r e n o wi m p a c t i n g t h e i r d e c i s i o n m a k i n g i n t h e f u t u r e o n h o wt h e y w i l l a d d r e s s i t .

I t n o w w a n t s t o a p p l y t h e s e s a m e d a t a s c i e n c ep r i n c i p l e s a n d s t e p s t o a n o t h e r H R i s s u e : e m p l o y e ec h u r n . I t r e a l i z e s w h e n g o o d p e o p l e l e a v e , i t c o s t s f a rm o r e t o r e p l a c e t h e m t h a n p r o v i d i n g s o m e i n c e n t i v e st o k e e p t h e m . S o i t w o u l d l i k e t o b e d a t a -d r i v e n i n t h eH R d e c i s i o n s i t m a k e s w i t h r e s p e c t t o e m p l o y e er e t e n t i o n .

T h e f o l l o w i n g q u e s t i o n s a r e a m o n g t h e o n e s t h e yw o u l d l i k e a n s w e r e d :

1. W h a t p r o p o r t i o n o f o u r s t a f f a r e l e a v i n g?

2. W h e r e i s i t o c c u r r i n g?

3. H o w d o e s A g e a n d L e n g t h o f S e r v i c e a f f e c tt e r m i n a t i o n?

4. W h a t , i f a n y t h i n g , e l s e c o n t r i b u t e s t o i t?

5. C a n w e p r e d i c t f u t u r e t e r m i n a t i o n s?

6. I f s o , h o w w e l l c a n w e p r e d i c t?

Page 7: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

STEP 2 COLLECT ANDMANAGE THEDATA

O f t e n t h e d a t a t o a n a l y z e t h e p r o b l e m s t a r t s w i t hw h a t i s c u r r e n t l y r e a d i l y a v a i l a b l e . A f t e r s o m e i n i t i a lp r o t o t y p i n g o f p r e d i c t i v e m o d e l s , i d e a s s u r f a c e f o ra d d i t i o n a l d a t a c o l l e c t i o n t o f u r t h e r r e f i n e t h e m o d e l .S i n c e t h i s i s f i r s t s t a b a t t h i s , t h e o r g a n i z a t i o n u s e so n l y w h a t i s r e a d i l y a v a i l a b l e .

A f t e r c o n s u l t i n g w i t h t h e i r H R I S s t a f f , t h e y f o u n d t h a tt h e y h a v e a c c e s s t o t h e f o l l o w i n g i n f o r m a t i o n :

E m p l o y e e I D , R e c o r d D a t e , B i r t h D a t e , O r i g i n a l H i r eD a t e , T e r m i n a t i o n D a t e ( i f t e r m i n a t e d ) , A g e , L e n g t h o fS e r v i c e , C i t y , D e p a r t m e n t , J o b t i t l e , S t o r e N a m e ,G e n d e r , t e r m i n a t i o n r e a s o n , t e r m i n a t i o n t y p e(v o l u n t a r y o r i n v o l u n t a r y ) , S t a t u s Y e a r – y e a r o f d a t a , S t a t u s – A C T I V E o r T E R M I N A T E D d u r i n g s t a t u s y e a r , B u s i n e s s U n i t -S t o r e s o r H e a d O f f i c e .

T h e c o m p a n y f o u n d o u t t h a t t h e y h a v e 10 y e a r s o fg o o d d a t a - f r o m 2006 t o 2015. I t w a n t s t o u s e 2006-2014 a s t r a i n i n g d a t a a n d u s e 2015 a s t h e d a t a t o t e s to n . T h e d a t a c o n s i s t s o f

* a s n a p s h o t o f a l l a c t i v e e m p l o y e e s a t t h e e n d o fe a c h o f t h o s e y e a r s c o m b i n e d w i t h ; * t e r m i n a t i o n s t h a t o c c u r r e d d u r i n g e a c h o f t h o s ey e a r s . T h e r e f o r e , e a c h y e a r w i l l h a v e r e c o r d s t h a t h a v ee i t h e r a s t a t u s o f ‘a c t i v e ’ o r ‘ t e r m i n a t e d ’ . O f t h ea b o v e i n f o r m a t i o n i t e m s l i s t e d , t h e ‘S T A T U S ’ o n e i st h e ‘d e p e n d e n t ’ v a r i a b l e - a c a t e g o r y t o b e p r e d i c t e d .M a n y o f o t h e r s a r e t h e i n d e p e n d e n t v a r i a b l e s- ‘p o t e n t i a l ’ p r e d i c t o r s .

Page 8: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

LET'S LOOK AT THE DATALet’s load in the data. (By the way, the data below istotally contrived) You can find the datasets for this company "here". Feelfree to use these datasets for the tutorial.

Step 1: Load in the data "Click here"

(A cursory look at the above summary doesn’t haveanything jump out as being data quality issues)

Step 2: View summary of data: "Click here"

Page 9: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Earlier we had indicated that we had both active records atend of year and terminates during the year for each of 10years going from 2006 to 2015. To have a population to model from (to differentiateACTIVES from TERMINATES) we have to include bothstatus types. It’s useful then to get a baseline of what percent/proportionthe terminates are of the entire population. It also answersour first question. Let’s look at that next.

Step 3: Calculate the percentage of our staff that is leaving "Click here"

It looks like it ranges from 1.97 to 4.85% with an average of 2.99%

Where are the terminations occurring? Let's look at some charts:

Step 4: Produce chart per business unit "Click here"

Page 10: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 5: Produce chart of (only) terminates by termination type andstatus year "Click here"

Generally most terminations seem to be voluntary year by year,except in the most recent years where is are some involutaryterminates. By the way, these kind of insights are often includedin a workforce dashboard.

It looks like terminates is the last 10 years have predominantlyoccurred in the STORES business unit. Only 1 terminate in HRTechnology which is in the head office.

Lets explore just the terminates for a few moments.

Page 11: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 6: Produce chart of (only) terminates by status year andtermination Reason "Click here"

It seems that there were layoffs in 2014 and 2015 whichaccounts for the involuntary terminates.

let's take this a bit further...

Page 12: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

When we look at the terminate by Department, a few thingstick out. Customer Service has a much larger proportion ofresignation compared to other departments. And retirement ingeneral is high is a number of departments.

Step 7: Produce chart of terminates By terminationreason and department "Click here"

Page 13: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 8: Now let's find out how age and length ofservice affect termination "Click here"

Density plots show some interesting things. For terminates thereis some elevation from 20 to 30 and a spike at 60. For length ofservice there are 5 spikes. One around 1 year, another onearound 5 years, and a big one around 15 years, and a couple at20 and 25 years.

Page 14: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 9: The next plot we're going to produce revolvesaround age and length of service distributions bystatus "Click here"

Boxplots show high average age for terminates as compared toactive. Length of service shows not much difference betweenactive and terminated.

That’s a brief general look at some of what the data is telling us.Our next step of course is model building.

Page 15: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

STEP 3 BUILD THEMODEL

i t s h o u l d b e m e n t i o n e d a g a i n t h a t f o r b u i l d i n gm o d e l s , w e n e v e r w a n t t o u s e a l l o u r d a t a t o b u i l dt h e m o d e l . T h i s c a n l e a d t o o v e r f i t t i n g - w h e r e i t m i g h t b e a b l et o p r e d i c t w e l l o n c u r r e n t d a t a t h a t i t s e e s a s i sb u i l t o n , b u t m a y n o t p r e d i c t w e l l o n d a t a t h a t i th a s n ’ t s e e n .

W e h a v e 10 y e a r s o f h i s t o r i c a l d a t a . w e w i l l u s e t h e f i r s t 9 t o t r a i n t h e m o d e l , a n d t h e10 t h y e a r t o t e s t i t . M o r e o v e r , w e w i l l u s e 10 f o l dc r o s s v a l i d a t i o n o n t h e t r a i n i n g d a t a a s w e l l .

S o b e f o r e w e a c t u a l l y t r y o u t a v a r i e t y o f m o d e l l i n ga l g o r i t h m s , w e n e e d t o p a r t i t i o n t h e d a t a i n t ot r a i n i n g a n d t e s t i n g d a t a s e t s .

l e t 's g o a h e a d a n d p a r t i t i o n t h e d a t a !

Page 16: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 1: Partition the data "Click here"

One of the things that characterizes R, is that the numberof functions and procedures that can be used are huge.So there are often many ways of doing things. Two of thebest R packages designed to be used for data scienceare caret and rattle.

In this tutorial I will use rattle. What is noteworthy aboutrattle is that it provides a GUI front end and generates thecode for it in the log on the backend. So you cangenerate models quickly.

I won’t be illustrating how to use rattle in this tutorial as aGUI, but rather show the code it generated along with thestatistical results and graphs. Please don’t get hungup/turned off by the code presented. The GUI front-endgenerated all the code below. I simply made cosmeticchanges to it. Please do concentrate on the flow of thedata science process in the article as one example ofhow it can be done. As a GUI rattle was able to generateall the below output in about 15 minutes of my effort. Onetutorial on the rattle GUI can be found here:

"RATTLE"

And here is a book on rattle: "BOOK"

Page 17: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

We should step back for a moment and review what aredoing here, and what are opening questions were. Weare wanting to predict who might terminate in the future.That is a ‘binary result’ or ‘category’. A person is either‘ACTIVE’ or ‘TERMINATED’. __Since it is a category tobe predicted we will choose among models/algorithmsthat can predict categories.

The models we will look at in rattle are:

Decision Trees (rpart)

Boosted Models (adaboost)

Random Forests (rf)

Support Vactor Models (svm)

Linear Models (glm)

Page 18: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 2: Create decision tree "Click here"

Lets first take a look at a decision tree model. This isalways useful because with these, you can get a visualtree model to get some idea of how the prediction occursin an easy to understand way.

Now for the other models...

We can now answer our next question from above:

What, if anything, else contributes to it?

From even the graphical tree output it looks like gender,and status year also affect it.

Step 3: Random forests "Click here"

Step 4: Adaboost "Click here"

Step 5: Support Vector Machines "Click here"

Step 6: Linear Models "Click here"

These were simply the vanilla running of these models. Inevaluating the models we have the means to comparetheir results on a common basis.

Page 19: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

In the evaluating models step, we are able to answer our 2 original questions stated at the beginning:

Can we predict? In a word ‘yes’.

How Well can we predict? In two words ‘fairly well’.

When it comes to evaluating models for predictingcategories, we are defining accuracy as to how manytimes did the model predict the actual. So we areinterested in a number of things.

The first of these are error matrices. In error matrices,you are cross-tabulating the actual results with predictedresults. If the prediction was a ‘perfect’ 100%, everyprediction would be the same as actual. This almostnever happens. The higher the correct prediction rate andlower the error rate, the better.

Let's take a look at the predictions of the aforementionedmodels...

Page 20: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 7: Error matrice Decision tree "Click here"

Well that was interesting!

Summarizing the confusion matrix showed that decisiontrees, random forests, and adaboost all predictedsimilarly. BUT Support Vector Machines and the LinearModels did worse for this data.

Step 8: Error matrice Random forest "Click here"

Step 9: Error matrice Adaboost "Click here"

Step 10: Error matrice Support vector model "Clickhere"

Step 11: Error matrice Linear model "Click here"

Page 21: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 12: AUC decision tree "Click here"

Another way to evaluate the models is the AUC. Thehigher the AUC the better. The code below generates theinformation necessary to produce the graphs.

Page 22: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 13: AUC Random forests "Click here"

Page 24: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 15: AUC Support Vector Machines "Click here"

Page 25: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 16: AUC Linear Models "Click here"

A couple of things to notice:

* It turns out that the adaboost model produces thehighest AUC. So we will use it to predict the 2016terminates in just a little bit. * The Linear model was worst.

Page 26: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

Step 17: Deploy final prediction model "Click here"

Let’s predict the 2016 Terminates.

In real life you would take a snapshot of data at end of2015 for active employees. For purposes of this exercisewe will do that but also alter the data to make both ageyear of service information – 1 year greater for 2016.

There were 93 predicted terminates for 2016!

Page 27: PE OPL E A N A L Y T I C S T UT OR I A L · I n t hi s t ut ori al , I wi l l demonst rat e how t o bui l d, eval uat e and depl oy your predi ct i ve model , usi ng R. T o st art

CONGRATS! YOU MADE IT

THE INTENT OF THIS TUTORIAL WAS:

TO ONCE AGAIN DEMONSTRATE A PEOPLE ANALYTICS EXAMPLE USING

R

TO DEMONSTRATE A MEANINGFUL EXAMPLE FROM THE HR CONTEXT

NOT TO BE NECESSARILY A BEST PRACTICES EXAMPLE, RATHER AN

ILLUSTRATIVE ONE

TO MOTIVATE THE HR COMMUNITY TO MAKE MUCH MORE EXTENSIVE

USE OF ‘DATA-DRIVEN’ HR DECISION MAKING.

TO ENCOURAGE THOSE INTERESTED IN USING R IN DATA SCIENCE TO

DELVE MORE DEEPLY INTO R’S TOOLS IN THIS AREA.

FROM A PRACTICAL PERSPECTIVE, IF THIS WAS REAL DATA FROM A

REAL ORGANIZATION, THE ONUS WOULD BE ON THE ORGANIZATION TO

MAKE ‘DECISIONS’ ABOUT WHAT THE DATA IS TELLING THEM. AN

INABILITY TO DO SO IS CALLED ANALYSIS PARALYSIS.

AGAIN, ENJOY THE PEOPLE ANALYTICS JOURNEY…

IF YOU WANT TO LEARN MORE ABOUT DOING HR ANALYTICS IN R,

INCLUDING CHURN ANALYTICS, ENGAGEMENT ANALYTICS, DATA

VISUALIZATION, AND MORE, CHECK OUT OUR COURSE ON PREDICTIVE

ANALYTICS IN THE HR ANALYTICS ACADEMY: "CLICK HERE".