实施有监督的机器学习技术,以进一步了解向客户授予信用并拒绝其信用的过程。此过程称为信用评分, 这是银行使用的一种广泛的方法,它为每个潜在客户分配300到850的分数,这是850客户可以收到的最高分数。 信用评分用于评估授予客户信用给放贷人的潜在风险。信用评分基于个人的信用报告, 该报告会同时考虑数字和分类变量,例如现有信用帐户的状态,信用额度,银行中现有信用的数量等。 最终,信贷放款人使用该分数来确定将在预定的利率和信贷限额下向哪些客户授予信贷贷款。
根据使用场景分为反欺诈评分卡、申请评分卡、行为评分卡、催收评分卡。
信用分能反映消费者偿还债务能力,一般由3个数字组成。信用评分有很多方法,不同的模型计算的信 用分会有几个点的浮动,这取决于不同的使用场景。
最受认可的信用分是FICO分数,由Fair Isaac Company提供。它可以给庄家提供50多种不同的顾 客信用分数,以供在不同的场景使用。其他的信用评分还有Vantage分,Community Empower分, 或其他Eperian, Equifax, TransUnion分数。
信用分一般在300到850之间。
信用评分模型是一种传统的信用风险量化模型,利用可观察到的借款人特征变量计算出一个数值(得分) 来代表债务人的信用风险,并将借款人归类于不同的风险等级。对个人客户而言, 可观察到的特征变量主要包括收入、资产、年龄、职业以及居住地等;对法人客户而言,包括现金流量、 财务比率等。庄家通过信用分来评估借贷的风险,借贷的具体条款和利率等。评分越高,条款就会对越有利。 不同的模型会侧重不同权重因子。信用评分模型的关键在于特征变量的选择和各自权重的确定。
目前,应用最广泛的信用评分模型有:
- 线性概率模型(Linear Probability Model)
- Logit模型
- Probit模型
- 线性辨别模型(Linear Discriminant Model)
FICO评分模型被认为是最稳定的模型,因为它自1989年以来已经被修订了很多次,以产生更精准的评 分。
经典FICO评分模型的评分在300到850之间。600以下是poor,740以下是excellent.
主要考虑五个因素:
- Payment history(35%)
- Credit utilization(30%)
- Credit history(15%)
- Types of credit(10%)
- New credit(10%)
2006年三个主要信用评估机构——Experian, Equifax, TransUnion联合推出,以与FICO分数 在信用评分商业中竞争。
此模型构成为:
- High Weight: Payment History (40%) 延迟支付记录将保持7年。
- Extreme Weight: Age And Type Of Credit (21%)
- Extreme Weight: Credit Utilization (20%)
- Medium Weight: Total Balances (11%)
- Low Weight: Recent Behavior (5%)
- Extremely Low Weight: Available Credit (3%)
- TransRisk
- Experian’s National Equivalency Score
- Credit Xpert Credit Score
- CE Credit Score
- Insurance Score
- Industry-Specific Credit Scores
尽管信用评分模型有很多,但基本上只有两类可以被统计验证:用统计评分分析的模型和用评判分析 的模型。
统计评分模型利用从多个机构获得的许多因子,将这些因子关联起来,并给它们赋予不同权重。模型 中不包含任何个人对的信用机构的评价。
评判评分模型考虑个人或团体的财务状况、支付记录、银行参考、甚至本机构自己关于个人或团体的 经验。这种模型在给出信用评分时更依赖于主观判断。
尽管信用评分模型是商业银行分析借款人信用风险的主要方法之一,但在使用过程中同样存在一些突 出问题:
- 信用评分模型是建立在对历史数据(而非当前市场数据)模拟的基础上,因此是一种向后看的模型。 由于历史数据更新速度比较慢,因此回归方程中各特征变量的权重在一定时间内保持不变, 从而无法及时反映公司信用状况的变化。
- 信用评分模型对借款人历史数据的要求相当高,商业银行需要相当长的时间才能建立起一个包括 大多数公司历史数据的数据库。此外,对新兴公司而言。由于其成立时间不长,历史数据则更为有限, 这使得信用评分模型的适用性和有效性受到影D向。
- 信用评分模型虽然可以给出客户信用风险水平的分数,却无法提供客户违约概率的准确数值, 而后者往注是信用风险管理最为关注的。
另一个示例:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
datafile = './data/'
data_train = pd.read_csv(datafile +'cs-train.csv')
"""
数据说明
SeriousDlqin2yrs:违约客户及超过90天逾期客户,bool型;
RevolvingUtilizationOfUnsecuredLines:贷款以及信用卡可用额度与总额度比例,百分比;
age:用户年龄,整型
NumberOfTime30-59DaysPastDueNotWorse:35-59天逾期但不糟糕次数,整型;
DebtRatio:负债率,百分比;
MonthlyIncome:月收入,整型;
NumberOfOpenCreditLinesAndLoans:开放式信贷和贷款数量,开放式贷款(分期付款如汽车贷款或抵押贷款)和信贷(如信用卡)的数量,整型;
NumberOfTimes90DaysLate:90天逾期次数:借款者有90天或更高逾期的次数,整型;
NumberRealEstateLoansOrLines:不动产贷款或额度数量:抵押贷款和不动产放款包括房屋净值信贷额度,整型;
NumberOfTime60-89DaysPastDueNotWorse:60-89天逾期但不糟糕次数:借款人在在过去两年内有60-89天逾期还款但不糟糕的次数,整型;
NumberOfDependents:家属数量:不包括本人在内的家属数量,整型;
"""
#变量重命名
columns = ({'SeriousDlqin2yrs':'target',
'RevolvingUtilizationOfUnsecuredLines':'percentage',
'NumberOfOpenCreditLinesAndLoans':'open_loan',
'NumberOfTimes90DaysLate':'90-',
'NumberRealEstateLoansOrLines':'estate_loan',
'NumberOfTime60-89DaysPastDueNotWorse':'60-89',
'NumberOfDependents':'Dependents',
'NumberOfTime30-59DaysPastDueNotWorse':'30-59'}
)
data_train.rename(columns=columns,inplace = True)
#查看数据缺失情况并填充缺失值
missing_df = data_train.isnull().sum(axis =0).reset_index()
missing_df
#我们看到'MonthlyIncome'和'NumberOfDependents'有缺失值
data_train.MonthlyIncome.isnull().sum()/data_train.shape[0]
data_train.Dependents.isnull().sum()/data_train.shape[0]
"""
对于缺失值处理,有很多种方法:
缺失值极多:若缺失值样本占总数比例极高,直接舍弃,因为作为特征加入反而会引入噪声值(可以使用删除近零常量的方法删除)。
非连续特征缺失值适中:如果缺值的样本适中,而该属性非连续值特征属性,就把NaN作为一个新类别,加入到类别特征中。
连续特征缺失值适中:如果缺值的样本适中,考虑给定一个step,然后离散化,将NaN作为一个type加入到属性类目中。
缺失值较少:考虑利用填充的办法进行处理。其中有均值、众数、中位数填充。
用sklearn里的RandomForest/KNN模型去拟合数据样本训练模型,然后去填充缺失值。
拉格朗日插值法。
"""
#由于MonthlyIncome缺失值达到29731条数据,比例较大,因此不能直接将缺失值删除,选择随机森林法,将有缺失值的变量分成已知特征和未知特征
#(仅含有缺失值),将已知 特征和标签进行训练,得到训练模型,对未知特征进行预测。
print(data_train['MonthlyIncome'].max())
print(data_train['MonthlyIncome'].min())
print(data_train['MonthlyIncome'].mean())
print(data_train['MonthlyIncome'].mode())
print(data_train['MonthlyIncome'].median())
print(data_train['MonthlyIncome'].skew())
# 用随机森林对缺失值进行预测
from sklearn.ensemble import RandomForestRegressor
# 预测填充函数
def rf_filling(df):
# 处理数集
process_miss = df.iloc[:,[5,0,1,2,3,4,6,7,8,9]]
#分成已知特征与未知特征
known = process_miss[process_miss.MonthlyIncome.notnull()].as_matrix()
unknown = process_miss[process_miss.MonthlyIncome.isnull()].as_matrix()
#X,要训练的特征
X = known[:,1:]
#y ,结果标签
y = known[:,0]
#训练模型
rf = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)
rf.fit(X,y)
#预测缺失值
pred = rf.predict( unknown[:,1:]).round(0)
#补缺缺失值
df.loc[df['MonthlyIncome'].isnull(),'MonthlyIncome'] = pred
return df
data_train = rf_filling(data_train)
print(data_train['MonthlyIncome'].max())
print(data_train['MonthlyIncome'].min())
print(data_train['MonthlyIncome'].mean())
print(data_train['MonthlyIncome'].mode())
print(data_train['MonthlyIncome'].median())
print(data_train['MonthlyIncome'].skew())
#Dependents变量缺失值比较少,直接删除,对总体模型不会造成太大影响。对缺失值处理完之后,删除重复项
data_train = data_train.dropna()
data_train = data_train.drop_duplicates()
data_train.info()
"""
异常值处理
缺失值处理完毕后,我们还需要进行异常值处理。异常值是指明显偏离大多数抽样数据的数值,比如个人客户的年龄大于100或小于0时,通常认为该值为异常值。找出样本总体中的异常值,通常采用离群值检测的方法。 离群值检测的方法有单变量离群值检测、局部离群值因子检测、基于聚类方法的离群值检测等方法。
在本数据集中,采用单变量离群值检测来判断异常值,采用箱线图。
"""
#age
sns.boxplot(data_train.age,palette = 'Set3',orient = 'v')
plt.show()
#将age小于0和大于100的值作为离群值删去
data_train = data_train[(data_train.age>0)&(data_train.age<100)]
#RevolvingUtilizationOfUnsecuredLines 和 DebtRatio
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([data_train.percentage,data_train.DebtRatio])
ax.set_xticklabels(['percentage','DebtRatio'])
plt.show()
#对于百分比大于1的为异常值,我们尝试分别用两种方法处理:1、直接删除;2、将离群值当空值处理,填充均值
data_train = data_train[(data_train.percentage<1)]
data_train = data_train[(data_train.DebtRatio<1)]
#变量0-59天,60-89天,90-三个异常值处理
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([data_train['30-59'],data_train['60-89'],data_train['90-']])
ax.set_xticklabels(['30-59','60-89','90-'])
plt.show()
#三个变量都有离群值,查看各个特征离群值数量
data_train[data_train['30-59']>60].shape
data_train[data_train['60-89']>90].shape
data_train[data_train['90-']>90].shape
#离群值数量较少,全部删除
data_train= data_train[data_train['30-59']<60]
data_train = data_train[data_train['60-89']<90]
data_train = data_train[data_train['90-']<90]
data_train = data_train.reset_index(drop=True)#重设索引
#MonthlyIncome
sns.boxplot(data_train.MonthlyIncome,palette = 'Set3',orient = 'v')
plt.show()
data_train[data_train['MonthlyIncome']>50000].shape
data_train = data_train[data_train['MonthlyIncome']<50000]
data_train = data_train.reset_index(drop=True)#重设索引
#EDA.首先分析好坏客户占比情况
data_train.target.value_counts().plot(kind = 'bar')
plt.show()
t = (data_train.target.value_counts()[1])/len(data_train)#坏样本较少,可以考虑采用Smote转换
sns.distplot(data_train.age)#基本符合正态分布
plt.show()
sns.distplot(data_train['MonthlyIncome'])
plt.show()
sns.distplot(data_train['DebtRatio'])
plt.show()
sns.distplot(data_train['30-59'])
plt.show()
#可以考虑使用Box_Cox 转换调整数据的偏度
#多变量分析
corr = data_train.corr()
corr
fig = plt.figure(figsize = (12,8))
ax1 = fig.add_subplot(1, 1, 1)
sns.heatmap(corr, annot=True, cmap='YlGnBu', ax=ax1, annot_kws={'size': 9, 'color': 'm'})#绘制相关性系数热力图
plt.show()
#由上图可以看出,各变量之间的相关性是非常小的,可以初步判断不存在多重共线性问题
#特征选择,woe分箱
from scipy import stats
def monoto_bin(Y, X, n = 20):
r = 0
total_bad = Y.sum()
total_good =Y.count()-total_bad
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n,duplicates='raise')})
d2 = d1.groupby('Bucket', as_index = True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.min().X, columns = ['min_' + X.name])
d3['min_' + X.name] = d2.min().X
d3['max_' + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3['total'] = d2.count().Y
#d3[Y.name + '_rate'] = d2.mean().Y
d3['badattr']=d3[Y.name]/total_bad
d3['goodattr']=(d3['total']-d3[Y.name])/total_good
d3['woe'] = np.log(d3['goodattr']/d3['badattr'])
iv = ((d3['goodattr']-d3['badattr'])*d3['woe']).sum()
d4 = (d3.sort_values(by = 'min_' + X.name)).reset_index(drop = True)
print ("=" * 80)
cut = []
cut.append(float('-inf'))
for i in range(1,n+1):
qua =X.quantile(i/(n+1))
cut.append(round(qua,4))
cut.append(float('inf'))
woe = list(d4['woe'].round(3))
return d4,iv,cut,woe
dfx1,ivx1,cutx1,woex1 = monoto_bin(data_train['target'],data_train['percentage'],n=10)
dfx2,ivx2,cutx2,woex2 = monoto_bin(data_train['target'],data_train['age'],n=10)
# dfx4,ivx4,cutx4,woex4 = monoto_bin(data_train['target'],data_train['DebtRatio'],n=10)
dfx5,ivx5,cutx5,woex5 = monoto_bin(data_train['target'],data_train['MonthlyIncome'],n=10)
plt.bar(range(len(woex1)),woex1)
plt.show()
plt.bar(range(len(woex2)),woex2)#完全单调,分箱效果不错
plt.show()
plt.bar(range(len(woex5)),woex5)
plt.show()
dfx1
monoto_bin(data_train['target'],data_train['percentage'],n=10)
def self_bin(Y, X, bin):
r = 0
total_bad = Y.sum()
total_good =Y.count()-total_bad
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.cut(X, bin)})
d2 = d1.groupby('Bucket', as_index = True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
d3 = pd.DataFrame(d2.min().X, columns = ['min_' + X.name])
d3['min_' + X.name] = d2.min().X
d3['max_' + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3['total'] = d2.count().Y
#d3[Y.name + '_rate'] = d2.mean().Y
#好坏比,求woe,证据权重,自变量对目标变量有没有影响,什么影响
d3['badattr']=d3[Y.name]/total_bad
d3['goodattr']=(d3['total']-d3[Y.name])/total_good
d3['woe'] = np.log(d3['goodattr']/d3['badattr'])
#iv,信息值,自变量对于目标变量的影响程度
iv = ((d3['goodattr']-d3['badattr'])*d3['woe']).sum()
d4 = (d3.sort_values(by = 'min_' + X.name)).reset_index(drop = True)
print ("=" * 80)
# print (d4)
woe = list(d4['woe'].round(3))
return d4,iv,woe
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx4 = [ninf,0,0.1,0.35,pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
len(cutx5)
dfx3, ivx3,woex3 = self_bin(data_train['target'],data_train['30-59'],cutx3)
dfx4, ivx4,woex4 = self_bin(data_train['target'],data_train['DebtRatio'],cutx4)
dfx6, ivx6,woex6 = self_bin(data_train['target'],data_train['open_loan'],cutx6)
dfx7, ivx7,woex7 = self_bin(data_train['target'],data_train['90-'],cutx7)
dfx8, ivx8,woex8 = self_bin(data_train['target'],data_train['estate_loan'],cutx8)
dfx9, ivx9,woex9 = self_bin(data_train['target'],data_train['60-89'],cutx9)
dfx10, ivx10,woex10 = self_bin(data_train['target'],data_train['Dependents'],cutx10)
y=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
index=data_train.columns.drop('target')
fig= plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(1, 1, 1)
ax1.bar(range(1,11), y, width=0.4,color = 'r',alpha = 0.6)#生成柱状图
ax1.set_xticks(range(1,11))
ax1.set_xticklabels(index, rotation=0, fontsize=12)
ax1.set_ylabel('IV', fontsize=14)
#在柱状图上添加数字标签
for i, v in enumerate(y):
plt.text(i+1, v+0.01, '%.4f' % v, ha='center', va='bottom', fontsize=12)
plt.show()
"""
根据IV值判断变量预测能力的标准:
< 0.02: useless for predition
0.02-0.1: weak predictor
0.1-0.3: medium predictor
0.3-0.5: strong predictor
大于0.5: suspicious or too good to be true
"""
#删除掉iv小于0.01的变量,DebtRatio,MonthlyIncome,open_loan,estate_loan,Dependents
def change_woe(d,cut,woe):
list=[]
i=0
while i<len(d):
value=d[i]
j=len(cut)-2
m=len(cut)-2
while j>=0:
if value>=cut[j]:
j=-1
else:
j -=1
m -= 1
list.append(woe[m])
i += 1
return list
#训练集转化
data_train['percentage'] = pd.Series(change_woe(data_train['percentage'], cutx1, woex1))
data_train['age'] = pd.Series(change_woe(data_train['age'], cutx2, woex2))
data_train['30-59'] = pd.Series(change_woe(data_train['30-59'], cutx3, woex3))
data_train['DebtRatio'] = pd.Series(change_woe(data_train['DebtRatio'], cutx4, woex4))
data_train['MonthlyIncome'] = pd.Series(change_woe(data_train['MonthlyIncome'], cutx5, woex5))
data_train['open_loan'] = pd.Series(change_woe(data_train['open_loan'], cutx6, woex6))
data_train['90-'] = pd.Series(change_woe(data_train['90-'], cutx7, woex7))
data_train['estate_loan'] = pd.Series(change_woe(data_train['estate_loan'], cutx8, woex8))
data_train['60-89'] = pd.Series(change_woe(data_train['60-89'], cutx9, woex9))
data_train['Dependents'] = pd.Series(change_woe(data_train['Dependents'], cutx10, woex10))
#删除对target不明显的变量
train_X =data_train.drop(['DebtRatio','MonthlyIncome','open_loan','estate_loan','Dependents'],axis=1)
# test_X =data_test.drop(['DebtRatio','MonthlyIncome','open_loan','estate_loan','Dependents'],axis=1)
#模型建立
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
x = train_X.drop('target',axis = 1)
y = train_X['target']
train_x,test_x,train_y,test_y = train_test_split(x,y,test_size = 0.3,random_state = 0)
train = pd.concat([train_y,train_x], axis =1)
test = pd.concat([test_y,test_x], axis =1)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
lr = LogisticRegression(penalty= 'l1')
lr.fit(train_x,train_y)
#绘制roc曲线
from sklearn.metrics import roc_curve, auc
# y_pred= lr.predict(train_x)
train_predprob = lr.predict_proba(train_x)[:,1]
test_predprob = lr.predict_proba(test_x)[:,1]
FPR,TPR,threshold =roc_curve(test_y,test_predprob)
ROC_AUC= auc(FPR,TPR)
plt.plot(FPR, TPR, 'b', label='AUC = %0.2f' % ROC_AUC)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('TPR')
plt.xlabel('FPR')
plt.show()
# 个人总分=基础分+各部分得分
import math
B = 20 / math.log(2)
A = 600 - B / math.log(20)
# 基础分
base = round(A+B *lr.intercept_[0], 0)
base
#计算分数函数
def compute_score(coe,woe,factor):
scores=[]
for w in woe:
score=round(coe*w*factor,0)
scores.append(score)
return scores
x1_percentage = compute_score(lr.coef_[0][0], woex1, B)
x2_age = compute_score(lr.coef_[0][1], woex2, B)
x4_59 = compute_score(lr.coef_[0][2], woex4, B)
x7_90 = compute_score(lr.coef_[0][3], woex7, B)
x9_60 = compute_score(lr.coef_[0][4], woex9, B)
def change_score(series,cut,score):
list = []
i = 0
while i < len(series):
value = series[i]
j = len(cut) - 2
m = len(cut) - 2
while j >= 0:
if value >= cut[j]:
j = -1
else:
j -= 1
m -= 1
list.append(score[m])
i += 1
return list
#导入test数据
test1 = pd.read_csv(datafile + 'cs-test.csv')
test2 = pd.DataFrame()
test2['x1_percentage'] = pd.Series(change_score(test1['percentage'], cutx1, x1_percentage))
test2['x2_age'] = pd.Series(change_score(test1['age'], cutx2, x2_age))
test2['x4_59'] = pd.Series(change_score(test1['DebtRatio'], cutx4,x4_59))
test2['x7_90'] = pd.Series(change_score(test1['90-'], cutx7, x7_90))
test2['x9_60'] = pd.Series(change_score(test1['60-89'], cutx9, x9_60))
test2['Score'] = test2['x1_percentage'] + test2['x2_age'] + test2['x4_59']+test2['x7_90']+test2['x9_60']+ base
sns.distplot(test2['Score'],bins = 30,color = 'r')#分数分布
plt.figure(figsize=(14,7))
plt.show()