This notebook will read the output of the cleanup stage, massage the data and output two pickled files with data for analysis.
analysis-predictors.bz2 - each row is a loan and the columns are the features available to investors
analysis-target.bz2 - each row is a loan and the value is the APY return of the loan
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# Read in the raw data provided by Lending Club
loan = pd.read_csv("LoanStats.csv")
# Overview of all the predictors provided by the Lending Club
# Unfortunately many of these are NaN particularly those indicators that might help us
# with an analysis of any potential discrimination
# Descriptions related to these terms are provided in a separate Excel file
for column in loan.columns:
# print("%-40s"%column, loan[column].dtype)
print(column)
loan.isna().sum()
import re
def subgrade_score(s):
return (ord(s[0]) - ord('A')) + (int(s[1]) - 1)/5
def extract_number(x):
try:
return int(re.search(r"\d+", str(x)).group(0))
except AttributeError:
return np.nan
# Clean the data related to the borrower's ZipCode
loan["zip_code_prefix"] = [extract_number(s) for s in loan.zip_code.astype(str)]
# Record today's date in order to calculate loan length
now = pd.to_datetime("Aug-2018")
loan["earliest_cr_line_date"] = pd.to_datetime(loan.earliest_cr_line,
format="%b-%Y")
loan.earliest_cr_line_date.head()
# Overall credit history is then today's date less the date of the
# First available information on the person's Credit history
loan["credit_history"] = now - loan.earliest_cr_line_date
loan["credit_history_years"] = loan.credit_history.dt.days.fillna(0) / 365.25
# Check that there are no 'NAs' in the credit history data column
loan.credit_history_years.isna().sum()
# Clean the length of employment predictor/column and check for 'NAs'
loan["emp_length_years"] = loan.emp_length.apply(extract_number)
loan["emp_length_years"].isna().sum()
# Cleanse the data related to the loan inquiries made within the last 6 months
# This is a typical credit worthiness metric
# The more frequently somehow has asked for a loan the less credit-worthy they are
loan.inq_last_6mths = loan.inq_last_6mths.fillna(0)
# Clean up the column/predictor to the months since last delinquent payment
# This is part of the credit history of the borrower
loan.mths_since_last_delinq = loan.mths_since_last_delinq.fillna(0)
loan.mths_since_last_record = loan.mths_since_last_record.fillna(0)
# Clean the data related to the employment history of the borrower
m = np.mean(loan.emp_length_years)
loan.emp_length_years = loan.emp_length_years.fillna(m)
loan.emp_length_years.isna().sum()
# Impute the missing data for the annual income of the borrowed using the average
loan.annual_inc = loan.annual_inc.fillna(np.mean(loan.annual_inc))
loan.dti = loan.dti.fillna(np.mean(loan.dti))
for g in sorted(loan.sub_grade.unique()):
print(g, subgrade_score(g))
loan.int_rate = loan.int_rate.str[:-1].astype(float)
loan["credit_score"] = loan.sub_grade.apply(subgrade_score)
loan.issue_d.min()
# Convert the loan issue date to a datetime in order to speed up processing
loan["issue_date"] = pd.to_datetime(loan.issue_d, format="%b-%Y")
loan.earliest_cr_line_date = pd.to_datetime(loan.earliest_cr_line, format="%b-%Y").fillna(loan.issue_date)
pd.to_datetime(['Apr-2008'], format="%b-%Y")
# Create a predictor/column with the year the loan was issued
loan["issue_date_year"] = loan.issue_date.dt.year
# Create a predictor/column with the date of the last payment
loan["last_pymnt_date"] = pd.to_datetime(loan.last_pymnt_d, format="%b-%Y")
# Create a predictor/column with the actual loan length
# This is not the same as the planned loan length
loan["loan_length"] = loan["last_pymnt_date"] - loan["issue_date"]
# Clean the data for the predictor/column loan length term
loan["loan_length_term"] = loan.loan_length.fillna(loan.loan_length)
# Clean the data for the predictor/column loan length in days
loan["loan_length_days"] = loan.loan_length_term.dt.days.fillna(1)
# Check that the dates have been properly changed
loan[["last_pymnt_date", "issue_date", "loan_length","grade","term","loan_status"]].head(-100)
loan.issue_date.describe()
# New column/predictor percentage of the revolving credit utilized
loan["revol_util_perc"] = loan.revol_util.astype(str).apply(extract_number)
loan["revol_util_perc"] = loan["revol_util_perc"].fillna(np.mean(loan.revol_util_perc))
# Calculate the length of the loan
loan["term_months"] = [extract_number(s) for s in loan.term]
loan["term_days"] = pd.to_timedelta(loan.term_months*30, unit='D')
# Calculate the Annual Percentage Rate per Annum
loan["apy"] = (loan.total_pymnt / loan.loan_amnt) ** (365.25/(loan.loan_length_days)) - 1
# In order to be able to use APY we need to remove loans with very short lengths
loan.query("loan_length_days > 60 and apy>1")[["funded_amnt", "total_pymnt", "loan_length_days", "apy", "loan_status"]]
loan.credit_history.head()
# Completed loans are those which are either charged off or fully paid
complete = loan.query("(loan_status=='Charged Off' or loan_status=='Fully Paid') and loan_length_days > 60 and apy < 1")
complete["good_loan"] = complete.loan_status=='Fully Paid'
# Overview of the loans that have been completed: charged off or fully repaid
complete.describe()
loan.head(30)
loan.isna().sum()
We are only concerned with loans that are complete. Either they are paid off or charged off. Loans that are still ongoing are also interesting to look at, but they require their own separate analysis. Because our time is limited, we can only do one set and complete loans seems the most promising.
In addition, there are a number of loans that have a small length. That is they were taken out and repaid quickly. I'm not sure why that is, but such loans make little difference. If an investor makes a loan and it is repaid in a few days, they he neither makes money nor loses money and must make a new loan. So for now, we ignore them.
public_columns = [
'loan_amnt','term_months','int_rate','credit_score',
# 'emp_title',
'emp_length_years',
'home_ownership',
'annual_inc',
'issue_date',
'verification_status',
# 'loan_status',
# 'pymnt_plan',
# 'desc',
# 'purpose',
# 'title',
# 'zip_code',
# 'addr_state',
'dti',
'delinq_2yrs',
# 'earliest_cr_line',
'credit_history_years',
'inq_last_6mths',
'mths_since_last_delinq',
'mths_since_last_record',
'open_acc',
'pub_rec',
'revol_bal',
'revol_util_perc',
'total_acc']
public_data = complete[public_columns]
public_data.isna().sum()
public_data.head()
public_data_y = complete.apy
public_data_y.describe()
Now, deal with categorical data using one-hot encoding.
categorical = ['verification_status','home_ownership']
public_data_dummy = pd.get_dummies(public_data[categorical],dummy_na=False)
public_data_cat = pd.concat([public_data, public_data_dummy], axis=1)
public_data_cat = public_data_cat.drop(categorical, axis=1)
public_data_cat.to_pickle("analysis-predictors.bz2")
public_data_y.to_pickle("analysis-target.bz2")
complete.to_pickle("complete.bz2")