问题描述
依据某国外匿名化处理后的真实数据集,通过建模,判断该用户是否已经流失。
1. 读取数据并分离特征与标签
import pandas as pd import numpy as np
# 读取数据 train_data = pd.read_csv('./Churn-Modelling.csv') test_data = pd.read_csv('./Churn-Modelling-Test-Data.csv')
x_train = train_data.iloc[:,:-1] y_train = train_data.iloc[:,-1].astype(int) x_test = test_data.iloc[:,:-1] y_test = test_data.iloc[:,-1].astype(int)
x_train.head()
数据说明:
RowNumber:行号
CustomerID:用户编号
Surname:用户姓名
CreditScore:信用分数
Geography:用户所在国家/地区
Gender:用户性别
Age:年龄
Tenure:当了本银行多少年用户
Balance:存贷款情况
NumOfProducts:使用产品数量
HasCrCard:是否有本行信用卡
IsActiveMember:是否活跃用户
EstimatedSalary:估计收入
Exited:是否已流失,这将作为我们的标签数据
2.特征工程
2.1 删除无用特征
# 删除前三列没用的数据 x_train = x_train.drop(labels=x_train.columns[[0,1,2]], axis=1) x_test = x_test.drop(labels=x_test.columns[[0,1,2]], axis=1)
x_train.head()
y_train[:5]
0 1 1 0 2 1 3 0 4 0 Name: Exited, dtype: int32
2.2 将字符串特征进行编码
# 国家与性别两列为非数值型数据,使用LabelEncoder进行编码,将其转换为数值数据 from sklearn.preprocessing import LabelEncoder Lb1 = LabelEncoder() x_train.iloc[:,1] = Lb1.fit_transform(x_train.iloc[:,1]) x_test.iloc[:,1] = Lb1.transform(x_test.iloc[:,1]) Lb2 = LabelEncoder() x_train.iloc[:,2] = Lb2.fit_transform(x_train.iloc[:,2]) x_test.iloc[:,2] = Lb2.transform(x_test.iloc[:,2])
x_train[:5]
x_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 10 columns): CreditScore 10000 non-null int64 Geography 10000 non-null int64 Gender 10000 non-null int64 Age 10000 non-null int64 Tenure 10000 non-null int64 Balance 10000 non-null float64 NumOfProducts 10000 non-null int64 HasCrCard 10000 non-null int64 IsActiveMember 10000 non-null int64 EstimatedSalary 10000 non-null float64 dtypes: float64(2), int64(8) memory usage: 781.3 KB
2.3 对特征数据进行归一化
from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test)
x_train[:5]
array([[-0.32622142, -0.90188624, -1.09598752, 0.29351742, -1.04175968, -1.22584767, -0.91158349, 0.64609167, 0.97024255, 0.02188649], [-0.44003595, 1.51506738, -1.09598752, 0.19816383, -1.38753759, 0.11735002, -0.91158349, -1.54776799, 0.97024255, 0.21653375], [-1.53679418, -0.90188624, -1.09598752, 0.29351742, 1.03290776, 1.33305335, 2.52705662, 0.64609167, -1.03067011, 0.2406869 ], [ 0.50152063, -0.90188624, -1.09598752, 0.00745665, -1.38753759, -1.22584767, 0.80773656, -1.54776799, -1.03067011, -0.10891792], [ 2.06388377, 1.51506738, -1.09598752, 0.38887101, -1.04175968, 0.7857279 , -0.91158349, 0.64609167, 0.97024255, -0.36527578]])
3. 建模预测与评估
# 使用逻辑回归进行建模 from sklearn.linear_model import LogisticRegression
lr=LogisticRegression() sgd=SGDClassifier() lr.fit(x_train,y_train) lr_y_predict=lr.predict(x_test)
#使用逻辑斯蒂回归墨香自带的评分函数score获得模型在测试集上的准确性结果 print('LogisticRegression测试集准确度:',lr.score(x_test,y_test)) print('LogisticRegression训练集准确度:',lr.score(x_train,y_train))
LogisticRegression测试集准确度: 0.761 LogisticRegression训练集准确度: 0.809
from sklearn.metrics import classification_report #使用classificaion_report模块获得LogisticRegression其他三个指标的结果 print(classification_report(y_test,lr_y_predict,target_names=['Exited','UnExited']))
precision recall f1-score support Exited 0.77 0.97 0.86 740 UnExited 0.68 0.15 0.25 260 avg / total 0.74 0.76 0.70 1000
结果表明该模型准确率只有76%,还有一定的优化空间。