Last Updated 6.13.18
1. The dataset is split roughly 50/50 between good and bad outcomes. Is this representative of the actual population?
Answer: The dataset contains a stratified random sample of records, where each record represents an applicant for a loan. For loans booked during a two-year application window, over 242,000 accounts went on to pay as negotiated over the first 12-36 months of their loan. A random sample of 5,000 of those accounts were selected to represent all “Good” accounts when constructing this dataset. Only 5,459 accounts made a seriously late payment in that same time period, and all of those accounts were included in this dataset as “Bad” accounts.
2. Over 500 entries in the data seem inconsistent, in having identical features but different outcomes (i.e. the entries with all -9's). Is this just noise in the data?
Answer: When a loan applicant’s credit bureau report was either not investigated or not found, all features obtained from the credit bureau report receive a special value of -9. (There are also two other special values: -8 for no usable information, and -7 for no information of that type.) The confounding of no bureau report investigated (most likely a VIP applicant) and no bureau report found (a negative trait for extending credit) is most likely responsible for the two different outcomes.
3. What is a "trade"? An "inquiry"? A "delinquency"?
Answer: These are term that describe information contained in a consumer credit bureau report. Every credit agreement between the consumer and a lending institution is represented by a separate “line” of information called a “trade line”, and is often truncated to the term “trade”. An “inquiry” is also a line of information, but captures when a lending institution has pulled a consumer’s credit bureau report in order to make a credit decision. The term “delinquency” refers to a payment received some period of time past its due date. This is typically measured in 30-day intervals, such as 60 days delinquent or 90 days delinquent.
4. Variable names like NumTrades60Ever2DerogPubRec and NumTrades90Ever2DerogPubRec are cryptic and the data dictionary is no help.
Answer: Somehow, what should be a slash “/” in the variable names got replaced with a “2”. Please disregard this cryptic typo. “NumTrades60Ever/DerogPubRec” can be decoded as follows: the number of trade lines (see #3) on a credit bureau report that record a payment received 60 days past its due date. This feature also checks all Public Records available for the consumer, and adds to this count any items considered “Derogatory”.
5. Can you elaborate on the following 2 features: NumSatisfactoryTrades and NumBank2NatlTradesWHighUtilization?
Answer: NumSatisfactoryTrades counts the number of credit agreements on a consumer credit bureau report with on-time payments (satisfactory). NumBank/NatlTradesWHighUtilization somehow replaced the slash “/” with a “2” in its variable name. This counts the number of credit cards on a consumer credit bureau report carrying a balance that is at 75% of its limit or greater. The ratio of balance to limit is referred to as “utilization”.