【阿旭机器学习实战】【32】预测银行客户是否会开设定期存款账户--逻辑回归

简介: 【阿旭机器学习实战】【32】预测银行客户是否会开设定期存款账户--逻辑回归

1.读取并查看数据

关注公众号:阿旭算法与机器学习,回复:“ML32”即可获取本文数据集、源码与项目文档

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
data = pd.read_csv('banking.csv', header=0)
data = data.dropna()
print(data.shape)
print(list(data.columns))
(41188, 21)
['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y']
data.head()
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y
0 44 blue-collar married basic.4y unknown yes no cellular aug thu ... 1 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1 0
1 53 technician married unknown no no no cellular nov fri ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.021 5195.8 0
2 28 management single university.degree no yes no cellular jun thu ... 3 6 2 success -1.7 94.055 -39.8 0.729 4991.6 1
3 39 services married high.school no no no cellular apr fri ... 2 999 0 nonexistent -1.8 93.075 -47.1 1.405 5099.1 0
4 55 retired married basic.4y no yes no cellular aug fri ... 1 3 1 success -2.9 92.201 -31.4 0.869 5076.2 1

5 rows × 21 columns

数据各列的含义

bank client data:

  • 1 - age (numeric)
  • 2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  • 3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  • 4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  • 5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  • 6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  • 7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

related with the last contact of the current campaign:

  • 8 - contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  • 9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  • 10 - day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  • 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  • 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; * 999 means client was not previously contacted)
  • 14 - previous: number of contacts performed before this campaign and for this client (numeric)
  • 15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

social and economic context attributes

  • 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
  • 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
  • 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  • 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
  • 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

  • 21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

2.数据分析及处理

data['education'].unique()
array(['basic.4y', 'unknown', 'university.degree', 'high.school',
       'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
      dtype=object)
# 将教育类型basic.9y,basic.6y,basic.4y 均变为Basic
data['education']=np.where(data['education'] =='basic.9y', 'Basic', data['education'])
data['education']=np.where(data['education'] =='basic.6y', 'Basic', data['education'])
data['education']=np.where(data['education'] =='basic.4y', 'Basic', data['education'])
data['education'].unique()
array(['Basic', 'unknown', 'university.degree', 'high.school',
       'professional.course', 'illiterate'], dtype=object)
data['y'].value_counts()
0    36548
1     4640
Name: y, dtype: int64
# 查看数据标签比例
sns.countplot(x='y', data = data, palette='hls')
plt.show()
plt.savefig('count_plot')

count_no_sub = len(data[data['y']==0])
count_sub = len(data[data['y']==1])
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print('未开户的百分比:  %.2f%%' % (pct_of_no_sub*100))
pct_of_sub = count_sub/(count_no_sub+count_sub)
print('开户的百分比:  %.2f%%' % (pct_of_sub*100))
未开户的百分比:  88.73%
开户的百分比:  11.27%

可以看到,数据非常的不均衡,会影响模型的训练效果。后续针对该问题进行处理。

data.groupby('y').mean()
age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed
y
0 39.911185 220.844807 2.633085 984.113878 0.132374 0.248875 93.603757 -40.593097 3.811491 5176.166600
1 40.913147 553.191164 2.051724 792.035560 0.492672 -1.233448 93.354386 -39.789784 2.123135 5095.115991

2.1 数据分析

购买定期存款的客户的平均年龄高于未购买定期存款的客户的平均年龄。

购买定期存款的客户的 pdays(自上次联系客户以来的日子)较低。 pdays越低,最后一次通话的记忆越好,因此销售的机会就越大。

令人惊讶的是,购买定期存款的客户的销售通话次数较低。

我们可以计算其他特征值(如教育和婚姻状况)的分布,以更详细地了解数据。

data.groupby('job').mean()
age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y
job
admin. 38.187296 254.312128 2.623489 954.319229 0.189023 0.015563 93.534054 -40.245433 3.550274 5164.125350 0.129726
blue-collar 39.555760 264.542360 2.558461 985.160363 0.122542 0.248995 93.656656 -41.375816 3.771996 5175.615150 0.068943
entrepreneur 41.723214 263.267857 2.535714 981.267170 0.138736 0.158723 93.605372 -41.283654 3.791120 5176.313530 0.085165
housemaid 45.500000 250.454717 2.639623 960.579245 0.137736 0.433396 93.676576 -39.495283 4.009645 5179.529623 0.100000
management 42.362859 257.058140 2.476060 962.647059 0.185021 -0.012688 93.522755 -40.489466 3.611316 5166.650513 0.112175
retired 62.027326 273.712209 2.476744 897.936047 0.327326 -0.698314 93.430786 -38.573081 2.770066 5122.262151 0.252326
self-employed 39.949331 264.142153 2.660802 976.621393 0.143561 0.094159 93.559982 -40.488107 3.689376 5170.674384 0.104856
services 37.926430 258.398085 2.587805 979.974049 0.154951 0.175359 93.634659 -41.290048 3.699187 5171.600126 0.081381
student 25.894857 283.683429 2.104000 840.217143 0.524571 -1.408000 93.331613 -40.187543 1.884224 5085.939086 0.314286
technician 38.507638 250.232241 2.577339 964.408127 0.153789 0.274566 93.561471 -39.927569 3.820401 5175.648391 0.108260
unemployed 39.733728 249.451677 2.564103 935.316568 0.199211 -0.111736 93.563781 -40.007594 3.466583 5157.156509 0.142012
unknown 45.563636 239.675758 2.648485 938.727273 0.154545 0.357879 93.718942 -38.797879 3.949033 5172.931818 0.112121
data.groupby('marital').mean()
age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y
marital
divorced 44.899393 253.790330 2.61340 968.639853 0.168690 0.163985 93.606563 -40.707069 3.715603 5170.878643 0.103209
married 42.307165 257.438623 2.57281 967.247673 0.155608 0.183625 93.597367 -40.270659 3.745832 5171.848772 0.101573
single 33.158714 261.524378 2.53380 949.909578 0.211359 -0.167989 93.517300 -40.918698 3.317447 5155.199265 0.140041
unknown 40.275000 312.725000 3.18750 937.100000 0.275000 -0.221250 93.471250 -40.820000 3.313038 5157.393750 0.150000
data.groupby('education').mean()
age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y
education
Basic 42.163910 263.043874 2.559498 974.877967 0.141053 0.191329 93.639933 -40.927595 3.729654 5172.014113 0.087029
high.school 37.998213 260.886810 2.568576 964.358382 0.185917 0.032937 93.584857 -40.940641 3.556157 5164.994735 0.108355
illiterate 48.500000 276.777778 2.277778 943.833333 0.111111 -0.133333 93.317333 -39.950000 3.516556 5171.777778 0.222222
professional.course 40.080107 252.533855 2.586115 960.765974 0.163075 0.173012 93.569864 -40.124108 3.710457 5170.155979 0.113485
university.degree 38.879191 253.223373 2.563527 951.807692 0.192390 -0.028090 93.493466 -39.975805 3.529663 5163.226298 0.137245
unknown 43.481225 262.390526 2.596187 942.830734 0.226459 0.059099 93.658615 -39.877816 3.571098 5159.549509 0.145003
%matplotlib inline
table=pd.crosstab(data.job,data.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Job title vs Purchase')
plt.xlabel('Job')
plt.ylabel('Proportion of Purchase')
plt.savefig('purchase_vs_job')

具有不同职位的人购买存款的频率不一样。 因此,职称可以是良好的预测因素。

table=pd.crosstab(data.marital,data.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Marital Status vs Purchase')
plt.xlabel('Marital Status')
plt.ylabel('Proportion of Customers')
plt.savefig('mariral_vs_pur_stack')

婚姻状况似乎不是好的预测因素。

table=pd.crosstab(data.education,data.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Education vs Purchase')
plt.xlabel('Education')
plt.ylabel('Proportion of Customers')
plt.savefig('edu_vs_pur_stack')

教育似乎是结果变量的良好预测指标。

table=pd.crosstab(data.day_of_week,data.y)#.plot(kind='bar')
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Day of Week vs Purchase')
plt.xlabel('Day of Week')
plt.ylabel('Proportion of Purchase')
plt.savefig('dow_vs_purchase')

一周工作时间不是预测结果的良好预测因素。

2.2 选取特征参数并进行独热编码

cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
# 对以上特征参数进行独热编码
for var in cat_vars:
    cat_list = pd.get_dummies(data[var], prefix=var)
    data=data.join(cat_list)
# 删除cat_vars原始类型特征列
data_final=data.drop(cat_vars, axis=1)
data_final.columns.values
array(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp_var_rate',
       'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_Basic', 'education_high.school',
       'education_illiterate', 'education_professional.course',
       'education_university.degree', 'education_unknown', 'default_no',
       'default_unknown', 'default_yes', 'housing_no', 'housing_unknown',
       'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes',
       'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug',
       'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may',
       'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri',
       'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue',
       'day_of_week_wed', 'poutcome_failure', 'poutcome_nonexistent',
       'poutcome_success'], dtype=object)

2.3 解决数据不均衡问题—使用SMOTE进行过采样

创建我们的训练数据后,我将使用SMOTE算法(合成少数过采样技术)对已经开户的用户进行上采样。 在高层次上,SMOTE:

通过从次要类(已经开户的用户)创建合成样本而不是创建副本来工作。
随机选择一个k-最近邻居并使用它来创建一个类似但随机调整的新观察结果。

使用如下命令安装:

conda install -c conda-forge imbalanced-learn

X = data_final.loc[:, data_final.columns != 'y']
y = data_final.loc[:, data_final.columns == 'y'].values.ravel()
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_resample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])
# we can Check the numbers of our data
print("过采样以后的数据量: ",len(os_data_X))
print("未开户的用户数量: ",len(os_data_y[os_data_y['y']==0]))
print("开户的用户数量: ",len(os_data_y[os_data_y['y']==1]))
print("未开户的用户数量的百分比: ",len(os_data_y[os_data_y['y']==0])/len(os_data_X))
print("开户的用户数量的百分比: ",len(os_data_y[os_data_y['y']==1])/len(os_data_X))
过采样以后的数据量:  51134
未开户的用户数量:  25567
开户的用户数量:  25567
未开户的用户数量的百分比:  0.5
开户的用户数量的百分比:  0.5

现在我们拥有完美平衡的1:1数据! 您可能已经注意到我仅对训练数据进行了过采样

3.训练模型

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(os_data_X, os_data_y.values.reshape(-1))
y_pred = logreg.predict(X_test)
print('在测试数据集上面的预测准确率: {:.2f}'.format(logreg.score(X_test, y_test)))
在测试数据集上面的预测准确率: 0.90
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision    recall  f1-score   support
           0       0.95      0.93      0.94     10981
           1       0.52      0.63      0.57      1376
    accuracy                           0.90     12357
   macro avg       0.74      0.78      0.76     12357
weighted avg       0.90      0.90      0.90     12357
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()


相关文章
|
8天前
|
机器学习/深度学习 TensorFlow API
机器学习实战:TensorFlow在图像识别中的应用探索
【10月更文挑战第28天】随着深度学习技术的发展,图像识别取得了显著进步。TensorFlow作为Google开源的机器学习框架,凭借其强大的功能和灵活的API,在图像识别任务中广泛应用。本文通过实战案例,探讨TensorFlow在图像识别中的优势与挑战,展示如何使用TensorFlow构建和训练卷积神经网络(CNN),并评估模型的性能。尽管面临学习曲线和资源消耗等挑战,TensorFlow仍展现出广阔的应用前景。
30 5
|
26天前
|
机器学习/深度学习 人工智能 算法
揭开深度学习与传统机器学习的神秘面纱:从理论差异到实战代码详解两者间的选择与应用策略全面解析
【10月更文挑战第10天】本文探讨了深度学习与传统机器学习的区别,通过图像识别和语音处理等领域的应用案例,展示了深度学习在自动特征学习和处理大规模数据方面的优势。文中还提供了一个Python代码示例,使用TensorFlow构建多层感知器(MLP)并与Scikit-learn中的逻辑回归模型进行对比,进一步说明了两者的不同特点。
61 2
|
26天前
|
机器学习/深度学习 数据挖掘 Serverless
手把手教你全面评估机器学习模型性能:从选择正确评价指标到使用Python与Scikit-learn进行实战演练的详细指南
【10月更文挑战第10天】评估机器学习模型性能是开发流程的关键,涉及准确性、可解释性、运行速度等多方面考量。不同任务(如分类、回归)采用不同评价指标,如准确率、F1分数、MSE等。示例代码展示了使用Scikit-learn库评估逻辑回归模型的过程,包括数据准备、模型训练、性能评估及交叉验证。
49 1
|
1月前
|
机器学习/深度学习 算法 知识图谱
【机器学习】逻辑回归原理(极大似然估计,逻辑函数Sigmod函数模型详解!!!)
【机器学习】逻辑回归原理(极大似然估计,逻辑函数Sigmod函数模型详解!!!)
|
1月前
|
机器学习/深度学习 存储 自然语言处理
【机器学习】基于逻辑回归的分类预测
【机器学习】基于逻辑回归的分类预测
|
1月前
|
机器学习/深度学习 数据采集 分布式计算
【Python篇】深入机器学习核心:XGBoost 从入门到实战
【Python篇】深入机器学习核心:XGBoost 从入门到实战
76 3
|
1月前
|
机器学习/深度学习 算法 数据可视化
【Python篇】深度探索NumPy(下篇):从科学计算到机器学习的高效实战技巧2
【Python篇】深度探索NumPy(下篇):从科学计算到机器学习的高效实战技巧
37 1
|
20天前
|
数据采集 机器学习/深度学习 TensorFlow
声纹识别实战:从数据采集到模型训练
【10月更文挑战第16天】声纹识别技术通过分析个人的语音特征来验证其身份,具有无接触、便捷的特点。本文将带你从零开始,一步步完成声纹识别系统的构建,包括数据采集、音频预处理、特征提取、模型训练及评估等关键步骤。我们将使用Python语言和相关的科学计算库来进行实践。
74 0
|
1月前
|
机器学习/深度学习 算法
【机器学习】逻辑回归介绍(逻辑回归应用场景,原理,损失及优化详解!!!)
【机器学习】逻辑回归介绍(逻辑回归应用场景,原理,损失及优化详解!!!)
|
1月前
|
机器学习/深度学习 人工智能 自然语言处理
【MM2024】阿里云 PAI 团队图像编辑算法论文入选 MM2024
阿里云人工智能平台 PAI 团队发表的图像编辑算法论文在 MM2024 上正式亮相发表。ACM MM(ACM国际多媒体会议)是国际多媒体领域的顶级会议,旨在为研究人员、工程师和行业专家提供一个交流平台,以展示在多媒体领域的最新研究成果、技术进展和应用案例。其主题涵盖了图像处理、视频分析、音频处理、社交媒体和多媒体系统等广泛领域。此次入选标志着阿里云人工智能平台 PAI 在图像编辑算法方面的研究获得了学术界的充分认可。
【MM2024】阿里云 PAI 团队图像编辑算法论文入选 MM2024
下一篇
无影云桌面