Predictive models are tools created to help make future business decisions. A predictive model is created by collecting and analyzing a set of past experiences. In business scenarios where a set of required experiences does not yet exist, the necessary information must be collected over time. Consider lenders who are launching a new product: they would like to predict which consumers will be most responsive to it. A response model cannot be immediately created, since the response behavior for the yet-to-be solicited prospects resides in the future. To solve this problem, they will collect data over time and use it to develop their response model. In other cases, performance such as response behavior is readily available from previous solicitations, and can be immediately used to create a retrospective dataset for modeling purposes.
This retrospective approach requires careful dataset design in order to produce accurate modeling results. In every modeling scenario, dataset design is wholly dependent on the business decision being made, and the timing of that decision. So, it is important to understand how the model will be implemented before dataset design can begin. Figure 1 below presents the implementation of a predictive model on a timeline with respect to the present day. It demonstrates how the development performance data simulates that implementation timeframe.
FIGURE 1: Retrospective Observation Date simulates model implementation date
Implementation of the model will occur at the point on the timeline labeled TODAY. The model’s purpose is to provide a prediction for an event that has yet to occur. A set of records where the outcome is already known is required to create this model. The development timeframe slides TODAY back in time to a point labeled “Observation Date.” Known performance information is then collected from the observation date up to the current date. The known outcome can be in the form of a continuous measurement, such as a dollar figure, or as a 0/1 indicator depicting whether an event, such as response, did or did not occur.
Now, consider the same timelines with respect to the predictive information as used for model development. This is shown in Figure 2 below.
FIGURE 2: Information available at model implementation time dictates candidate model predictors
As time slides back in order to capture the known outcome, it also slides back in terms of the timeframe for candidate predictors. Here too, model implementation dictates what predictive information is appropriate for model inclusion. At the point of implementing the model, and making a prediction for the event that has yet to occur, no information exists past today. This puts a necessary constraint on the candidate predictors included in the model development dataset. They cannot contain any information that can only be captured after the observation date. If the predictive model relied on factors that occurred past this date, implementation of the model would require the ability to see into the future, and pull this information backward in time.
Performance leakage occurs when future data is pulled into the predictive model, thereby violating the above constraint. For projects involving a binary outcome, it is easy to spot during the model development process by examining the predictors’ Information Values in descending order. There is no absolute threshold to look for, but analysts should be prompted to investigate for timing issues if they see predictors with very high values, clearly presenting themselves as outliers when compared to the other predictors.
Analysts should be especially aware of performance leakage with model development projects that rely on archived bureau data as a source for predictors. If the date of the archive is not correctly aligned with the observation date, it could prove disastrous for the project. Consider an example where the performance being predicted is whether a consumer did or did not accept an offer of credit. A retrospective dataset can be assembled by establishing an observation date, and flagging all solicited records with a 1 or 0 according to whether they did or did not accept the offer. This set of records can then be matched to a bureau archive, and a model can be developed using credit bureau attributes and scores as predictors for future acceptance.
Consider what the attribute “Number of new tradelines established in the last 3 months” would look like if the archive was pulled after the observation date.
FIGURE 3: Predictive attributes coming from credit bureau archive past observation date
When the Weight of Evidence pattern is examined for this attribute, it will show an artificial negative spike for the bin “0 new trades established,” and inflated positive values for the bins capturing 1 or more new trades. This is because, for some portion of the “accept” records, the newly established trade is being included in the recent trade counter. The interpretation of the WoE pattern will be to falsely conclude that the best candidates for accepting the offer are those who recently established new credit with someone else.
If this timing error is not caught, this seemingly miraculous predictor (with its unusually high Information Value) will certainly make it into the scorecard, where it will inappropriately dominate the overall predictive strength of the model. Even if the error is caught, it may not be straightforward to correct. It may be difficult to identify which records were adversely affected and which were not, since the fix would require knowing the timing of each consumer’s response as either before or after the archive date. In a case like this, disastrous is not an overstatement: performance leakage can lead to false predictions based on wrongly interpreted data.
Many other examples of performance leakage can be cited, but they all result from the same timing misalignment. In many cases, it is worth spending as much time on the design of the dataset as will be spent on the development of the model itself. That well-known adage of “garbage in, garbage out” exists for some very good reasons, especially with respect to data-informed prediction.