Machine learning is a science of using principled and well-defined techniques to learn from data. While it is obvious that data enables learning, its usefulness is capped by data richness and data quality; this fact often goes underappreciated. Using poor quality data can too easily lead to bad models, to overfitting, and to ultimately poor model performance down the line. In addition to turn around cost for correcting models, this can potentially disrupt the underlying business operations.
In the context of machine learning, it is imperative to have supportive evidence that is unambiguous, and also rich enough to justify the conclusions drawn from it. With this in mind, it is important to note:
- Learning from randomness can be detrimental to model performance.
- An attempt to learn beyond the limits of your data is just as bad.
A target is influenced by a multitude of factors, many of which are observable and explainable. Some factors, typically those that are unobservable, bring variability to data. These random signals are referred to as stochastic noise. It is hard to deduce anything useful from this regardless of how accommodative the learning capacity may be. The noisier the data, the less one can infer from it, therefore, requirements for data resources go up with increasing noise levels.
When the data provided is contradictory, it’s hard for the interpreter to take the correct action.
Noise need not be source propagated. In many cases, the receiver has bounds to learning, and anything beyond that boundary is effectively noise (often referred to as deterministic noise). In some cases, models built on inadequate data are incapable of explaining the inherent complexity in the data. Therefore, an attempt to interpret deterministic noise can lead to overfitting.
Overfitting occurs when data is trained without taking in cognizance model performance for out of sample. While most associate overfitting to randomness in data, it is astounding how often it is a result of self-inflicted, highly complex usage of model specification.
Too much complexity in an explanation of data can be a bad thing…especially if it cannot be understood.
In machine learning, a rule of thumb helps us understand this bound. For a dataset with N data points, the most complex function that can be attempted to be learnt with reasonable confidence is given by a VC dimension of ~N/20. This explains how model complexity is capped by data resources. In order to build accurate, meaningful and robust models, we need to acknowledge and treat randomness, and ultimately control for model complexity.
Treat stochastic noise and control for model complexity with a Binning Library
Variables, whether continuous or categorical, can be transformed into mutually exclusive and exhaustive bins; each bin then acts as a predictive feature. Binning of variables is a tradeoff exercise in accuracy and degree of confidence (precision) space, just like any bias- variance tradeoff.
Bias and accuracy are inversely related; lower bias is synonymous with higher accuracy. Variance of performance across different data samples is representative of the degree of confidence in learning. Higher variance is indicative of lower robustness, leading to lower confidence in learning.
In binning, thicker bins are associated with higher robustness (lower variance, higher bias) while thinner bins do well on account of accuracy (lower bias, higher variance). Therefore, it is critical that balanced bins are created, since they do well on both accounts. In this pursuit, FICO® Model Builder’s interactive binning library helps via autobinner functionality with multiple specifications and validation across out of sample Weight of Evidence.
In Figure 1 below, we analyze the target's relationship to years of credit history, and see that applicants with shorter histories tend to default more often-- shown as lower Weights of Evidence (WoE)-- than applicants with longer histories. The label: MV01 represents the sparse sub-population with no recorded history with the credit bureau. The Interactive Binning Library allows this bin to be grouped or weight-constrained in order to achieve the same (or perhaps an even lower) weight in the model than its neighbor, Label: [-,2).
Figure 1: Interactive Binning Library in FICO® Model Builder
As shown in the transition from Figure 2 to Figure 3 below, binning averages stochastic noise, simplifies learning and fortifies it with higher degree of confidence (vis-à-vis a continuous or highly granulated noisy variable). It helps tremendously to escape the traps set by noise in data.
Subsequent to binning, it is helpful to impose constraints to the weight patterns of bins. Mostly monotonic, or sometimes U shaped as per the business logic, these constraints prevent models from meandering off into needlessly noisy shapes. Such meandering can be highly complex and comes at the cost of a lower degree of confidence in learning.
Constraints guide pattern discovery and effectively keep the complexity of learning in check. Figure 4 below shows how applying monotonicity constraints on bin weights in Figure 3, which had a more volatile pattern to begin with, results in a smoother and monotonic weight pattern, which will likely better align with business expectations.
Sound binning and constraining helps in the process of simple, reliable and parsimonious learning. I hope it is no more a wonder as to why FICO® Model Builder and the upcoming offerings in FICO® Analytic Workbench™ have such a comprehensive and interactive binning libraries.
Has binning helped you make more robust models in a recent project? Want to learn more about binning in FICO’s analytic tools? Join our discussions about binning, machine learning, and related topics in the Analytics Community.