Skip navigation

Analytics

11 posts

Binning provides the ability to address possible operational and regulatory constraints, palatability issues, and computation of reason codes, where required.

 

Binning allows for applying constraints across levels:

practical issues 1.png

Pairwise constraints can be applied to any predictor bins, including those containing special or missing values. This allows for model coefficients to adhere to the monotonically increasing or decreasing pattern that is expected, or to maintain any Weight of Evidence pattern inherent in the data. Individually selected bins can also be constrained to receive a “neutral” coefficient.

 

The computation of Reason Codes helps to explain resulting model predictions on an individual observation level. Binned predictors allow for a comparison between the maximum contribution each predictor could add to the final score, and the actual contribution based on the observation’s predictor value. Predictors with the maximum difference can be cited as those most responsible for an observation’s below average score.

 

Binning makes it easy to explain the model to non-modelers.

Predictor binning is a precursory step to building classed models. Once model development is complete, scorecard coefficients can be scaled and presented as integer values. The result provides an at-a-glance understanding of the relationship between each predictor and its target.

 

Classed models are easy to interpret and understand:

practical issues 2.png

 

The scaled Weight assigned to each bin preserves the underlying relationship between predictor and target. Above average values generally suggest a higher propensity to be a “1”, and below average values generally suggest a higher propensity to be a “0”.  The average (or neutral) weight for each predictor is assigned to the bin labeled “otherwise”, which captures observations with missing or unknown information. This makes the model highly transparent and easy to interpret.

 

Not only do weight assignments depict positive versus negative traits when compared to the neutral value, but they also show the relative magnitude of the predictive content that each variable contributes to the model. Predictors with the most extreme weight values and the widest range around the average can have the most influence on the final score.

 

Most existing deployment systems support classed models.

Classed models can be coded in a variety of programming languages, which makes them compatible with most deployment systems.

 

For more details, watch the recording of my webinar: FICO Webinar: Why Use Binned Variables in Predictive Models?

Binning allows the grouping of any outlier with its neighbors in order to minimize its impact.

data issues 1.png

For continuous predictors, an observation with an extremely high value will be treated as all other observations contained in the highest bin. Likewise, an observation with an extremely low value will be grouped into the lowest bin, and will be less likely to artificially influence model results.

 

Binning extracts predictive signal from missing and/or special values.

data issues 2.png

Missing and/or special values can be placed in their own unique bin, and can be treated as any other level. There is often predictive signal associated with these two categories, indicating either a positive or negative trait in relation to the target. Binning helps incorporate this signal into the final model, rather than losing this information by assuming a neutral relationship.

 

For more details, watch the recording of my webinar: FICO Webinar: Why Use Binned Variables in Predictive Models?

Binning provides a unifying framework for categorical and continuous predictors, as well as binary and continuous targets.

 

The binning process supports both continuous and categorical predictors. Continuous predictors can be put through an auto-binning algorithm that returns bin breaks optimized to a specific target. Unique values of categorical predictors can remain in their own individual bins, or can be combined into a coarser binning. In any case, the precursory steps for model development remain consistent across predictors, and also across projects.

 

Binning provides generalization in terms of both predictors and targets:

process issues 1.png

 

Note that, for continuous targets, bin-level predictive assessment is based on Normalized Mean, and variable-level assessment is based on R2.

 

Binning supports predictive content measures that are invariant to the population odds (binary target) or to the population mean (continuous target).

 

Weight of Evidence derives its numeric value from the distribution of observations within each principal set.  As shown below, even when population odds are multiplied by a factor of 10, the relationship between predictor and binary target remains unchanged.

 

Weight of Evidence provides for normalization:

process issues 2.png
By its definition, the same holds true for Information Value.

 

Information Value provides for normalization:

process issues 3.png

 

This normalization provides a consistent basis for making comparisons. The variable-level Information Value can be used to compare the predictive strength of variables within a project, and also across projects. Predictors with higher Information Values have greater predictive strength than those with lower values:

process issues 4.png

 

Similarly, for continuous targets, Normalized Mean and R2 measures are both invariant to the population mean. Predictors with higher R2 values have greater predictive strength than those with lower values. This provides projects based on a continuous target with a consistent basis for making comparisons as well.

 

For more details, watch the recording of my webinar: FICO Webinar: Why Use Binned Variables in Predictive Models?

Binning helps visualize the relationship between a predictor and the target variable.

An x-y scatter plot is helpful for depicting the relationship between a continuous predictor and its continuous target, but loses its effectiveness when the target is binary. As seen below, plotting a predictor against a binary target is inconclusive:

analytic issues 1.png

 

This issue can be addressed by introducing the act of binning (or classing), which provides insight through Weight of Evidence:

analytic issues 2.png

The numeric and graphical interpretation of Weight of Evidence is as follows:

  • 0 :  level k is neutral indicator
  • + :  level k is positive indicator   (higher propensity to be a “1”)
  • -  :  level k is negative indicator  (higher propensity to be a “0”)

 

Binning helps capture non-linear relationships between predictors and the target variable.

A piece-wise fitting methodology eliminates the requirement to transform predictor values for the purpose of forcing a linear relationship between the two. As a result, binning allows for understanding and preserving non-linear relationships:

 

analytic issues 3.png

 

A regression model would produce the linear fit depicted by the dotted line above.  But binning would produce the piece-wise constant fit depicted by the solid yellow line, which is closer to the underlying relationship observed in the data. In addition, binning ensures that the resulting model coefficients remain in the context of the original, non-transformed predictor values, making the model more transparent and easier to interpret. 

 

Consider an example where a large retailer needs to build a model to predict which consumers are most likely to purchase diapers in one of their stores. Their objective is to offer a coupon for diapers with the goal of bringing new parents into the store to purchase additional baby products as well. To execute this promotion most effectively, the retailer needs to control the expense of sending the coupon by targeting an appropriate audience; they can perform a test mailing and plot response rates by various demographics, in order to get the required data.

 

Example results of the test mailing performed to measure response rates by demographics:

 

analytic issues 4.png

 

The expected heightened response to the diaper coupon is seen in the 25-44 age range, and also in the 55-64 age range. If the retailer were using a linear regression model, Age would need to be transformed to smooth out this bi-modal pattern in response, which would disguise the actual relationship between Age and response. But using a classed model that bins the values for Age, the resulting coefficient pattern tells a valuable story regarding who is buying diapers: the parents…AND the grandparents.

 

For more details, watch the recording of my webinar: FICO Webinar: Why Use Binned Variables in Predictive Models?

Binning is the process of creating mutually exclusive, collectively exhaustive categories for the values of each candidate model predictor. This includes a category, or bin, reserved for capturing missing information for each predictor. Classed models (such as Scorecards) calculate an optimized coefficient (ß) for each model predictor bin, in addition to an intercept term (ß0). The resulting model prediction for any “scored” observation is calculated by summing the appropriate coefficients, as determined by the predictor values for that observation. This produces an effective model that is highly transparent and easy to interpret.

 

Below is an example of a discrete additive model with a single binned predictor (X), transformed into indicator (dummy) variables for each user-defined range, or bin:

 

why bin 1.png

Here’s a more formal equation for scoring observation i, using a model with J predictors, and Kj bins per predictor j:

why bin 2.png

There are several types of modeling issues that can be addressed by predictor binning, including:

  1. Analytic issues
  2. Process issues
  3. Data issues
  4. Practical issues

 

Each issue uniquely benefits from binning. I will explain each in depth in a series of corresponding posts. If you have any specific questions you'd like to see addressed, please comment here.

 

For more details, watch the recording of my webinar: FICO Webinar: Why Use Binned Variables in Predictive Models?

In Analytics Workbench 1.0.1, we added a custom button to the Zeppelin toolbar that will help you in three ways:

  1. End your notebook session
  2. Restart your notebook session
  3. Test your notebook for common errors

 

Before we jump into how it will help, let me just show you where the button is and what it does. It's at the right side of the toolbar, and looks like a recycle icon:

 

It's a new and obvious way to restart the back-end processes connected to your notebook. Previously, your only means to do the same was to restart or rebind of individual interpreters (which can still be useful if you need the finer control of restarting individual interpreters).

 

When clicked, the button shares a little more detail on what it's about to do, and asks you to confirm:

 

 

So when and where is this useful? I regularly use it in three different situations:

 

  1. Ending your notebook session.
    Suppose you've been working all day on your data science project, and you want to give everything a rest, so you can start fresh again in the morning (or maybe next week). It's a great idea to terminate your notebook session, and release all the processes and memory on the server and Spark cluster. It's a neighborly thing to do, for sure.

  2. Restarting your notebook session.
    If ever you find yourself in a weird state (maybe some interpreters are throwing ConnectException errors, or you've tangled yourself up in knowing which variables you have or haven't set), this simple button will basically reset "everything". Your Python or Scala or R sessions are terminated, all their variables are flushed away, but your code is still here to cleanly recreate them in a fresh new session.

  3. Testing your notebook for common errors.
    Interactive notebooks are great tools, but are easily susceptible to a subtle coding error that will haunt you — or someone else — later. A short story might help:

    Suppose you make a new notebook, and create a data frame using everyone's favorite name, df. As you code deeper and deeper into the data problem, you realize you're going to need df1, df2, df3, and so on. So you go back and rename that first one df1, and try to remember to correct it everywhere, but you miss just one or two places. Because you've already run the paragraph that created it, that original df object is still around, and any code left expecting it will keep running, without any overt errors. But if you ran the notebook from a blank slate, there would be no such df variable, leading potentially to all kinds of hard-to-debug behaviors or errors.

    You can imagine how later, perhaps much later, this problem can have you pulling out your hair in search of a cause. ("Well, it ran fine yesterday, and I didn't change anything!") A good practice is to test your notebook by stopping the session, and then just running all the code again from top-to-bottom, to ensure it's all still self-consistent. It's always a good idea to perform this as a flight-check before sharing your notebook.

A few more words on the side-effects of the button

In all three situations, the button behaves the same: it shuts down any running jobs from your notebook, it relinquishes all the server-side processes and Spark cluster sessions running on behalf of your notebook. And the next time you ask a paragraph to run, fresh new processes will spring back to life, on demand.

 

Rest assured, the button will not delete or reset any textual or graphical output in your notebook, and it won't delete any datasets that you've written back to Analytics Workbench or other persistent storage areas, like S3.

 

It's an unglamorous but useful little button, and I hope you find it helpful. Happy notebooking!

Raise your hand if you knew Kalman filters underlie the rocket guidance systems that were used in the moon landings. It’s true, recursive estimations can be made based on Kalman filtering.

 

This algorithm uses measurements observed over time to adaptively estimate variables on the fly and in real-time; this is illustrated in the formula below.

 

Kalman Filters.png

It is critical that we maintain ‘state’, which simply can be viewed as the past iternative estimate of variables where we do not maintain the history of transaction, but of adaptive estimate and update estimates, to get the current ‘read’ or ‘state’ of the variables defining an entity we monitor. A profile state is continually augmented by incoming streaming data, allowing real-time adaption of the state of an entity; this is at the heart of behavioral analytics.

 

The state contains many variables and can also be applied to our terrestrial lives, not just space travel. The result is an accurate real-time estimate of variables tracking behavior, where each variable is essentially a mini-model changing in real-time. This provides understanding of the trajectory of behaviors and how these trajectories are changing, one example is the behavior of a credit card customer. These mini-models are then fed into progressively more complex machine learning algorithms to generate final scores.

mini models.png

This method allows real-time reactive understanding of customers, financial accounts, computer intrusion, and marketing propensity (to name just a few), and yes, also rocket guidance. Have you taken advantage of streaming data in any of your analytic models?

Outlier detection leverages known and typical patterns to gain insights on the unknown. To do so, it uses unsupervised analytics. This isn’t just theoretical: outlier detection machine learning is actively used in anti-money laundering efforts.

 

When executing outlier analytics, countless customer’s banking transactions continually adjust the behavioral archetypes associated with client accounts. When we plot the archetypal distributions of customers, we see that many SARs (suspicious activity reports) are outliers from normal customers along certain archetypes. Deviations from the clusters indicate abnormal behaviors to be investigated. All this analysis can be done without a history of SAR filings; this is powerful when the historical SARs are not captured or when you want to find new anti-money launders and not just replicate known SAR patterns in data.

 

Archetype Distribution.png

Here are the (impressive) results of this unsupervised analytic model:

  • ~40% of SARs detected at 0.1% review rate

 

Outlier detection in unsupervised machine learning is applicable across many different industries. The applications will grow exponentially as we find that the speed and pace of data flowing at us exceeds the framework of historical data and outcome collection that we typically see in supervised analytics. Have you used outlier detection to solve any interesting problems?

On occasion, you may see that a paragraph refuses to run, and instead gives you this rather grudging error message:

 

java.net.ConnectException: Connection refused (Connection refused)

 

Generally, we see this happen when a notebook's interpreter (usually a Scala, Pyspark or SQL interpreter) has simply fallen over. The poor thing got all tuckered out and is basically taking a nap on you.

 

The good news it that you can restart the interpreter yourself, in just a few clicks. (I'll show you how.) The bad news is that the behind-the-scenes "memory" of your notebook will be lost, and you'll likely need to re-run many of your paragraphs (perhaps all of them), from top to bottom. Rest assured, the code in your notebook is safe, and any non-temporary files or datasets you wrote out to disk are still around. But all those ephemeral things -- Pyspark variables, Scala variables, Spark context, Zeppelin context, in-memory data frames, temporary SQL tables, etc. -- need to be recreated, by rerunning your paragraphs.

 

So, we need to restart (also called "rebind") the interpreters. How do we do that? It's pretty easy.

 

UPDATE (Dec 2017): With the release of AW 1.0.1, restarting your interpreters has become much easier, because Your Zeppelin notebook now has a Stop button. But if you prefer the more precise (or more tedious) process, keep reading here.

 

First, at the top of your notebook is the Interpreter Binding UI, hidden under a little gear icon in the upper right corner:

 

That opens a slide-down tray of interpreter settings, just above the first paragraph of your notebook.

 

Along the left side, you'll see an ordered list of interpreter groups, each shown as a blue button, and each with a little recycle-like icon next to it labeled "Restart". Find the interpreter group you need to restart (i.e., the one associated with the paragraph that is refusing to run correctly, and here's a hint: it's most likely the spark interpreter group), and just hit its restart button!

 

 

Now, click Save to dismiss the Interpreter Bindings UI, and start re-running your paragraphs. You should be back in action. If not, please contact our support line so we can help you through the problem.

 

NOTE: In AW 1.0.1, we added a new notebook toolbar button to reset all the active interpreters associated with your notebook. (That means we will shut them all down, and each interpreter will automatically restart itself when you run a paragraph calling for it.)

Predictive models are tools created to help make future business decisions. A predictive model is created by collecting and analyzing a set of past experiences. In business scenarios where a set of required experiences does not yet exist, the necessary information must be collected over time. Consider lenders who are launching a new product: they would like to predict which consumers will be most responsive to it. A response model cannot be immediately created, since the response behavior for the yet-to-be solicited prospects resides in the future. To solve this problem, they will collect data over time and use it to develop their response model. In other cases, performance such as response behavior is readily available from previous solicitations, and can be immediately used to create a retrospective dataset for modeling purposes.

 

This retrospective approach requires careful dataset design in order to produce accurate modeling results. In every modeling scenario, dataset design is wholly dependent on the business decision being made, and the timing of that decision. So, it is important to understand how the model will be implemented before dataset design can begin. Figure 1 below presents the implementation of a predictive model on a timeline with respect to the present day. It demonstrates how the development performance data simulates that implementation timeframe.

 

FIGURE1.png

FIGURE 1: Retrospective Observation Date simulates model implementation date

 

Implementation of the model will occur at the point on the timeline labeled TODAY. The model’s purpose is to provide a prediction for an event that has yet to occur. A set of records where the outcome is already known is required to create this model. The development timeframe slides TODAY back in time to a point labeled “Observation Date.” Known performance information is then collected from the observation date up to the current date. The known outcome can be in the form of a continuous measurement, such as a dollar figure, or as a 0/1 indicator depicting whether an event, such as response, did or did not occur.

 

Now, consider the same timelines with respect to the predictive information as used for model development. This is shown in Figure 2 below.

 

 

FIGURE2.png

FIGURE 2: Information available at model implementation time dictates candidate model predictors

 

As time slides back in order to capture the known outcome, it also slides back in terms of the timeframe for candidate predictors. Here too, model implementation dictates what predictive information is appropriate for model inclusion. At the point of implementing the model, and making a prediction for the event that has yet to occur, no information exists past today. This puts a necessary constraint on the candidate predictors included in the model development dataset. They cannot contain any information that can only be captured after the observation date. If the predictive model relied on factors that occurred past this date, implementation of the model would require the ability to see into the future, and pull this information backward in time.

 

Performance leakage occurs when future data is pulled into the predictive model, thereby violating the above constraint. For projects involving a binary outcome, it is easy to spot during the model development process by examining the predictors’ Information Values in descending order. There is no absolute threshold to look for, but analysts should be prompted to investigate for timing issues if they see predictors with very high values, clearly presenting themselves as outliers when compared to the other predictors.

 

 

 

Analysts should be especially aware of performance leakage with model development projects that rely on archived bureau data as a source for predictors. If the date of the archive is not correctly aligned with the observation date, it could prove disastrous for the project. Consider an example where the performance being predicted is whether a consumer did or did not accept an offer of credit. A retrospective dataset can be assembled by establishing an observation date, and flagging all solicited records with a 1 or 0 according to whether they did or did not accept the offer. This set of records can then be matched to a bureau archive, and a model can be developed using credit bureau attributes and scores as predictors for future acceptance.

 

Consider what the attribute “Number of new tradelines established in the last 3 months” would look like if the archive was pulled after the observation date.

 

FIGURE3.png

FIGURE 3: Predictive attributes coming from credit bureau archive past observation date

 

When the Weight of Evidence pattern is examined for this attribute, it will show an artificial negative spike for the bin “0 new trades established,” and inflated positive values for the bins capturing 1 or more new trades. This is because, for some portion of the “accept” records, the newly established trade is being included in the recent trade counter. The interpretation of the WoE pattern will be to falsely conclude that the best candidates for accepting the offer are those who recently established new credit with someone else.

 

If this timing error is not caught, this seemingly miraculous predictor (with its unusually high Information Value) will certainly make it into the scorecard, where it will inappropriately dominate the overall predictive strength of the model. Even if the error is caught, it may not be straightforward to correct. It may be difficult to identify which records were adversely affected and which were not, since the fix would require knowing the timing of each consumer’s response as either before or after the archive date. In a case like this, disastrous is not an overstatement: performance leakage can lead to false predictions based on wrongly interpreted data.

 

Many other examples of performance leakage can be cited, but they all result from the same timing misalignment. In many cases, it is worth spending as much time on the design of the dataset as will be spent on the development of the model itself. That well-known adage of “garbage in, garbage out” exists for some very good reasons, especially with respect to data-informed prediction.

 

 

One of the great features of the Zeppelin notebook in our Analytics Workbench is the ability to write code in the language best suited for any task. Python, R, SQL and Scala each have some stand-out strengths, and if we're going to do some real data science here, you'll need to know how to access the AW dataset in each of these critical languages.

 

Here's a quick tutorial, which complements the "Data Access v1.2" sample notebook you can find in this Community.

 

We'll assume that somebody has already uploaded at least one dataset into the AW data inventory. In this case, we'll focus on the "HELOC_with_State" dataset:

 

 

Whether you are a Python pro or novice, the first step in this brief journey will often be with a simple Pyspark cell, to get a handle on that data:

 

%pyspark

# get a DataFrame reader from Analytics Workbench's existing Data

data_from_AW = aw.data.read('HELOC_with_State')

 

The magic here is the aw.data.read() function, which is only available in AW's pyspark cells, and takes the (case sensitive) argument of the dataset's name.

 

If you want to plow forward in Python, great, you're all set! Interrogate that PySpark DataFrame object all you like.

 

But if you want to go into Scala or SQL or R, you'll take one more step before you're done with Python, and that is to register the dataset (temporarily) into the Spark's SQL context:

 

# register a temporary view of data in SQL Context as "heloc"

data_from_AW.createOrReplaceTempView("heloc")

 

And now, accessing that data from the other three languages is at most a simple one liner:

 

Language
Interpreter
Example Access
SQL%sqlselect * from heloc
R%spark.r

heloc.df <- sql("select * from heloc")

Python%pyspark

heloc_df = sqlContext.table( "heloc" )

Scala%spark

val helocDF = sqlContext.table( "heloc" )

 

That was easy. Now, rest assured you can do fancier things than just ask for the SQL table. You could also apply filters and subset the columns and so forth, with typical SQL syntax, but we'll save that for another day.

 

At any time, if you'd like to check the status of Spark's SQL context and see which datasets are already available there, you can just ask for a quick table dump ("show tables") in a SQL cell:

 

"Houston, we have a heloc!"

 

And this is just where the fun begins. To go even further, check out the Sample Notebooks available in this Community, and come back often.