Wherever there’s data, there’s a story to be told. I was browsing Kaggle for a dataset to analyze for a webinar (you can watch it here if you’re a member of WITI), and I came across this HR Attrition dataset. I’m used to the world of credit risk and lending scores, so the prospect of analyzing employee data piqued my interest as something completely different. I set out to apply the scoring concepts and analytic models I typically use to help banks make lending decisions to help managers make decisions about their employees.
I set out to see if I could use employee data to predict attrition. Anytime you do analytics it’s important to understand the decision being made that you want to enhance. In this case, I want to help HR inform managers about their employee attrition likelihood so they can take action before it’s too late. While a good manager will be in touch with each employee to proactively detect changes in behavior or events that might affect attrition, I want to use analytics to help inform manager’s discussions with their employees and suggest actions they might recommend to improve retention.
I’ve outlined my methods and findings below. As a note: The data used in this project is not real employee data, I got it from Kaggle, so you can check it out for yourself and join the discussion.
Investigate the Data
You’ve all heard the saying “garbage in, garbage out.” Well, it’s true of data analysis as well. With this employee data, we care about whether the attrition performance is representative of typical employees. So before actually beginning the analysis, we must make sure the data is clean and reliable. To do so, we check distributions and values in the data.
In real life, we would know more about the data and how representative our sample might be. For instance, it is important to account for historical change, like an acquisition or opening a new office in a different location, to ensure the population you draw from in your dataset is representative of the future records you will use for decisioning.
This looks like a good, clean dataset on the surface. Let’s get started.
Once you’re sure you have a clean, robust set of data that is representative of the future, you can get to work relating data points to a “target”. In this case, the target is whether or not an employee is likely to stay with the company.
Binning the Variables
Binning is the process of taking a dataset that has prior decision variables and grouping those variables into ranges so you can see how predictive they are. With this, you can review the information value of each variable to discern predictability. This provides a measure of how “related” the variable is to employee attrition; the list below is ranked from most to least predictive of attrition. In my webinar, I polled the audience to check their “mental model” of these variables: Over 90% thought “job satisfaction” would be the most predictive variable of employee attrition. Take a look at the image below to how that mental model compares to the actual results.
I used FICO® Analytics Workbench™ to bin the data, that’s a peek at my screen. We see that the results do not match what most people expected. Overtime is at the top of the list, which means it is extremely related to attrition. Job satisfaction falls near the middle, meaning it is not nearly as strongly related. So, the mental model of the audience was not correct, according to this data. This may be an indication of bias in the data, since a variable like “job satisfaction” must come from a survey, and likely employees that answer surveys are not representative of the employee population.
Right away, I start thinking about the decisions I might make using this information. For instance, managers can effect change on “monthly income” or “overtime”, but cannot change “years at company” or “marital status” (unless the company opens a dating service!).
This list gives a nice overview of variable relation to the target, but I wanted to dig deeper. I went on to examine how each variable effect attrition and then created a single Attrition Score. More details on that in my next blog, stay tuned.