基于机器学习模型预测信用卡潜在用户（XGBoost、LightGBM和Random Forest）（一）-阿里云开发者社区

基于机器学习模型预测信用卡潜在用户（XGBoost、LightGBM和Random Forest）

随着数据科学和机器学习的发展，越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中，我将分享如何利用机器学习模型来预测信用卡的潜在客户。此项目基于我整理的代码和文件，涉及数据预处理、数据可视化、模型训练、预测及结果保存的完整流程。

项目概述

本项目旨在使用机器学习模型预测哪些客户最有可能成为信用卡的潜在客户。我们将使用三个主要的机器学习模型：XGBoost、LightGBM和随机森林（Random Forest）。以下是项目的主要步骤：

1、数据预处理
2、数据可视化

3、模型训练
4、模型预测
5、模型保存

1. 数据预处理

数据预处理是机器学习项目中至关重要的一步。通过清洗和准备数据，我们可以提高模型的性能和准确性。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns

#Loading the dataset
df_train=pd.read_csv("dataset/train_s3TEQDk.csv")
df_train["source"]="train"
df_test=pd.read_csv("dataset/test_mSzZ8RL.csv")
df_test["source"]="test"
df=pd.concat([df_train,df_test],ignore_index=True)
df.head()

1. Checking and Cleaning Dataset :

#Checking columns of dataset
df.columns

Index(['ID', 'Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code',
       'Vintage', 'Credit_Product', 'Avg_Account_Balance', 'Is_Active',
       'Is_Lead', 'source'],
      dtype='object')

#Checking shape 
df.shape

(351037, 12)

#Checking unique values 
df.nunique()

ID                     351037
Gender                      2
Age                        63
Region_Code                35
Occupation                  4
Channel_Code                4
Vintage                    66
Credit_Product              2
Avg_Account_Balance    162137
Is_Active                   2
Is_Lead                     2
source                      2
dtype: int64

#Check for Null Values
df.isnull().sum()

ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product          41847
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64

Observation:

Null values are present in Credit _Product column.

#Fill null values in Credit_Product feature
df['Credit_Product']= df['Credit_Product'].fillna("NA")

#Again check for null values
df.isnull().sum()

ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product              0
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64

#Checking Datatypes and info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351037 entries, 0 to 351036
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   351037 non-null  object 
 1   Gender               351037 non-null  object 
 2   Age                  351037 non-null  int64  
 3   Region_Code          351037 non-null  object 
 4   Occupation           351037 non-null  object 
 5   Channel_Code         351037 non-null  object 
 6   Vintage              351037 non-null  int64  
 7   Credit_Product       351037 non-null  object 
 8   Avg_Account_Balance  351037 non-null  int64  
 9   Is_Active            351037 non-null  object 
 10  Is_Lead              245725 non-null  float64
 11  source               351037 non-null  object 
dtypes: float64(1), int64(3), object(8)
memory usage: 32.1+ MB

#Changing Yes to 1 and No to 0 in Is_Active column to covert  data into float

df["Is_Active"].replace(["Yes","No"],[1,0],inplace=True)

df['Is_Active'] = df['Is_Active'].astype(float)
df.head()

#Now changing all categorical column into numerical form using label endcoding
cat_col=[ 'Gender', 'Region_Code', 'Occupation','Channel_Code', 'Credit_Product']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cat_col:
    df[col]= le.fit_transform(df[col])


df_2= df
df_2.head()

#Separating the train and test
df_train=df_2.loc[df_2["source"]=="train"]
df_test=df_2.loc[df_2["source"]=="test"]
df_1 = df_train

#we can drop column as they are irrelevant and have no effect on our data
df_1.drop(columns=['ID',"source"],inplace=True)
df_1.head()

2. 数据可视化

数据可视化有助于我们更好地理解数据的分布和特征。以下是一些常用的数据可视化方法：

import warnings
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize']  = (10,6)
plt.rcParams['font.size']  = 16
sns.set_style("whitegrid")


sns.distplot(df['Age']);

sns.distplot(df['Avg_Account_Balance'])
plt.show()

#Countplot for Gender feature 
# plt.figure(figsize=(8,4))
sns.countplot(df['Gender'],palette='Accent')
plt.show()

#Countplot for Target variable i.e 'Is_Lead'
target = 'Is_Lead'
# plt.figure(figsize=(8,4))
sns.countplot(df[target],palette='hls')
print(df[target].value_counts())

0.0    187437
1.0     58288
Name: Is_Lead, dtype: int64

plt.rcParams['figure.figsize']  = (12,6)

#Checking occupation with customers
# plt.figure(figsize=(8,4))
sns.countplot(x='Occupation',hue='Is_Lead',data=df,palette= 'magma')
plt.show()

#Plot showing Activness of customer in last 3 months with respect to Occupation of customer
# plt.figure(figsize=(8,4))
sns.catplot(y='Age',x='Occupation',hue='Is_Active',data=df,kind='bar',palette='Oranges')
plt.show()

3. 模型训练

我们将使用三个模型进行训练：XGBoost、LightGBM和随机森林。以下是模型的训练代码：

# To balance the dataset , we will apply undersampling method
from sklearn.utils import resample
# separate the minority and majority classes
df_majority = df_1[df_1['Is_Lead']==0]
df_minority = df_1[df_1['Is_Lead']==1]

print(" The majority class values are", len(df_majority))
print(" The minority class values are", len(df_minority))
print(" The ratio of both classes are", len(df_majority)/len(df_minority))

 The majority class values are 187437
 The minority class values are 58288
 The ratio of both classes are 3.215704776283283

# undersample majority class
df_majority_undersampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=0)
# combine minority class with oversampled majority class
df_undersampled = pd.concat([df_minority, df_majority_undersampled])

df_undersampled['Is_Lead'].value_counts()
df_1=df_undersampled

# display new class value counts
print(" The undersamples class values count is:", len(df_undersampled))
print(" The ratio of both classes are", len(df_undersampled[df_undersampled["Is_Lead"]==0])/len(df_undersampled[df_undersampled["Is_Lead"]==1]))

 The undersamples class values count is: 116576
 The ratio of both classes are 1.0

# dropping target variable 
#assign the value of y for training and testing phase
xc = df_1.drop(columns=['Is_Lead'])
yc = df_1[["Is_Lead"]]

df_1.head()

#Importing necessary libraries
from sklearn import metrics
from scipy.stats import zscore
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score,roc_curve
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

#Import warnings
import warnings
warnings.filterwarnings('ignore')

#Standardizing value of x by using standardscaler to make the data normally distributed
sc = StandardScaler()
df_xc = pd.DataFrame(sc.fit_transform(xc),columns=xc.columns)
df_xc.head()

#defining a function to find fit of the model

def max_accuracy_scr(names,model_c,df_xc,yc):
    accuracy_scr_max = 0
    roc_scr_max=0
    train_xc,test_xc,train_yc,test_yc = train_test_split(df_xc,yc,random_state = 42,test_size = 0.2,stratify = yc)
    model_c.fit(train_xc,train_yc)
    pred = model_c.predict_proba(test_xc)[:, 1]
    roc_score = roc_auc_score(test_yc, pred)
    accuracy_scr = accuracy_score(test_yc,model_c.predict(test_xc))
    if roc_score> roc_scr_max:
        roc_scr_max=roc_score
        final_model = model_c
        mean_acc = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").mean()
        std_dev = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").std()
        cross_val = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy")
    print("*"*50)
    print("Results for model : ",names,'\n',
          "max roc score correspond to random state " ,roc_scr_max ,'\n',
          "Mean accuracy score is : ",mean_acc,'\n',
          "Std deviation score is : ",std_dev,'\n',
          "Cross validation scores are :  " ,cross_val) 
    print(f"roc_auc_score: {roc_score}")
    print("*"*50)

#Now by using multiple Algorithms we are calculating the best Algo which performs best for our data set 
accuracy_scr_max = []
models=[]
#accuracy=[]
std_dev=[]
roc_auc=[]
mean_acc=[]
cross_val=[]
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('Decision Tree Classifier',DecisionTreeClassifier()))
models.append(("GausianNB",GaussianNB()))

for names,model_c in models:
    max_accuracy_scr(names,model_c,df_xc,yc)

**************************************************
Results for model :  Logistic Regression 
 max roc score correspond to random state  0.727315712597147 
 Mean accuracy score is :  0.6696918411779096 
 Std deviation score is :  0.0030322593046897828 
 Cross validation scores are :   [0.67361469 0.66566588 0.66703839 0.67239974 0.66974051]
roc_auc_score: 0.727315712597147
**************************************************
**************************************************
Results for model :  Random Forest 
 max roc score correspond to random state  0.8792762631904103 
 Mean accuracy score is :  0.8117279862602139 
 Std deviation score is :  0.002031698139189051 
 Cross validation scores are :   [0.81043061 0.81162342 0.81158053 0.81115162 0.81616985]
roc_auc_score: 0.8792762631904103
**************************************************
**************************************************
Results for model :  Decision Tree Classifier 
 max roc score correspond to random state  0.7397495282209642 
 Mean accuracy score is :  0.7426399792028343 
 Std deviation score is :  0.0025271129138200485 
 Cross validation scores are :   [0.74288043 0.74162556 0.74149689 0.73870899 0.74462792]
roc_auc_score: 0.7397495282209642
**************************************************
**************************************************
Results for model :  GausianNB 
 max roc score correspond to random state  0.7956111563031266 
 Mean accuracy score is :  0.7158677336619202 
 Std deviation score is :  0.0015884106712636206 
 Cross validation scores are :   [0.71894836 0.71550504 0.71546215 0.71443277 0.71499035]
roc_auc_score: 0.7956111563031266
**************************************************

First Attempt:Random Forest Classifier

# Estimating best n_estimator using grid search for Randomforest Classifier
parameters={"n_estimators":[1,10,100]}
rf_clf=RandomForestClassifier()
clf = GridSearchCV(rf_clf, parameters, cv=5,scoring="roc_auc")
clf.fit(df_xc,yc)
print("Best parameter : ",clf.best_params_,"\nBest Estimator : ", clf.best_estimator_,"\nBest Score : ", clf.best_score_)

Best parameter :  {'n_estimators': 100} 
Best Estimator :  RandomForestClassifier() 
Best Score :  0.8810508979668068

#Again running RFC with n_estimator = 100
rf_clf=RandomForestClassifier(n_estimators=100,random_state=42)
max_accuracy_scr("RandomForest Classifier",rf_clf,df_xc,yc)

**************************************************
Results for model :  RandomForest Classifier 
 max roc score correspond to random state  0.879415808805665 
 Mean accuracy score is :  0.8115392510996895 
 Std deviation score is :  0.0008997445291505284 
 Cross validation scores are :   [0.81180305 0.81136607 0.81106584 0.81037958 0.81308171]
roc_auc_score: 0.879415808805665
**************************************************

xc_train,xc_test,yc_train,yc_test=train_test_split(df_xc, yc,random_state = 80,test_size=0.20,stratify=yc)
rf_clf.fit(xc_train,yc_train)
yc_pred=rf_clf.predict(xc_test)

plt.rcParams['figure.figsize']  = (12,8)

#  Random Forest Classifier Results

pred_pb=rf_clf.predict_proba(xc_test)[:,1]
Fpr,Tpr,thresholds = roc_curve(yc_test,pred_pb,pos_label=True)
auc = roc_auc_score(yc_test,pred_pb)

print(" ROC_AUC score is ",auc)
print("accuracy score is : ",accuracy_score(yc_test,yc_pred))
print("Precision is : " ,precision_score(yc_test, yc_pred))
print("Recall is: " ,recall_score(yc_test, yc_pred))
print("F1 Score is : " ,f1_score(yc_test, yc_pred))
print("classification report \n",classification_report(yc_test,yc_pred))

#Plotting confusion matrix
cnf = confusion_matrix(yc_test,yc_pred)
sns.heatmap(cnf, annot=True, cmap = "magma")

 ROC_AUC score is  0.8804566893762799
accuracy score is :  0.8127466117687425
Precision is :  0.8397949673811743
Recall is:  0.7729456167438669
F1 Score is :  0.8049848132928354
classification report 
               precision    recall  f1-score   support

         0.0       0.79      0.85      0.82     11658
         1.0       0.84      0.77      0.80     11658

    accuracy                           0.81     23316
   macro avg       0.81      0.81      0.81     23316
weighted avg       0.81      0.81      0.81     23316






<AxesSubplot:>

plt.rcParams['figure.figsize']  = (12,6)

#plotting the graph for area under curve for representing accuracy of data
plt.plot([0,1],[1,0],'g--')
plt.plot(Fpr,Tpr)
plt.xlabel('False_Positive_Rate')
plt.ylabel('True_Positive_Rate')
plt.title("Random Forest Classifier")
plt.show()

基于机器学习模型预测信用卡潜在用户（XGBoost、LightGBM和Random Forest）（二）：https://developer.aliyun.com/article/1535286

基于机器学习模型预测信用卡潜在用户（XGBoost、LightGBM和Random Forest）（一）