In my last blog, I explained how I organized the data in a Kaggle HR Attrition dataset to figure out how I could use analytics to improve employee retention. So far, I made sure the data was clean and representative, and then binned the data to investigate how each variable relates to my target of attrition. With a ranked list of each variable’s information value in tow (see below), I moved on to create a visual representation of the attrition likelihood.
Gain Insight with Distribution Analysis
Just as I expected, working overtime leads to a higher chance of attrition, as you can see by the large blue bar in the chart below. Stock options, however, have the opposite effect, as employees with stocks are more likely to stay put. Looking to the number of companies someone has worked for provides an interesting result. If an employee has worked at 2-4 companies, they are at low risk to attrite. However, having worked at 5 or more companies is related to a spike in attrition.
So, what can we learn from all that? If someone is working loads of overtime, doesn’t have stock options, and on top of that, has worked for over 5 other employers, they are very likely to leave the company.
Build an Attrition Score
Now that we understand how each individual variable is likely to affect attrition, we can combine the variables from our dataset in the best way possible to predict how likely an employee is to leave. Below is a visual representation of the variables that contribute to the score I created. The size of each bubble indicates how predictive the variable is, but our scoring algorithms also take into account correlation between variables, so the amount of unique predictive information is shown by the “non-overlapping” amount of each bubble.
This scorecard was built purely for illustrative purposes, but I think it makes enough sense to extrapolate throughout this analysis. Examining the scorecard, I can gain information about the relationships between variables and my attrition target. For example, monthly income is more predictive than distance from home, but more correlated with overtime.
From this scorecard, I built a single Attrition Score. Accounts with a high score have a high likelihood of attrition, while those with a low score have a low likelihood. Now that we have a score we can rank-order our employees by likelihood of attrition.
Graphing the score distribution allows us to identify which individuals are likely to attrite and draw a line to represent a score cutoff. After all that analysis, we know that employees who score high enough to fall to the right of our cutoff are likely to leave. Now what?
I started this project with the intention of helping managers make decisions. So, if you’re a manager and it’s end of the year review time, what do you do if your employee’s score indicates a high likelihood of attrition? That’s where decision trees and a retention strategy come in. Stay tuned to the FICO Community Blog for the rest of my analysis plus an explanation of how my results can actually be implemented into a company’s processes.