基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)(一)

简介: 基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)(一)

基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)

随着数据科学和机器学习的发展,越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中,我将分享如何利用机器学习模型来预测信用卡的潜在客户。此项目基于我整理的代码和文件,涉及数据预处理、数据可视化、模型训练、预测及结果保存的完整流程。

项目概述

本项目旨在使用机器学习模型预测哪些客户最有可能成为信用卡的潜在客户。我们将使用三个主要的机器学习模型:XGBoost、LightGBM和随机森林(Random Forest)。以下是项目的主要步骤:

1、数据预处理
2、数据可视化

3、模型训练
4、模型预测
5、模型保存

1. 数据预处理

数据预处理是机器学习项目中至关重要的一步。通过清洗和准备数据,我们可以提高模型的性能和准确性。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns
#Loading the dataset
df_train=pd.read_csv("dataset/train_s3TEQDk.csv")
df_train["source"]="train"
df_test=pd.read_csv("dataset/test_mSzZ8RL.csv")
df_test["source"]="test"
df=pd.concat([df_train,df_test],ignore_index=True)
df.head()

image.png

image.png

1. Checking and Cleaning Dataset :

#Checking columns of dataset
df.columns
Index(['ID', 'Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code',
       'Vintage', 'Credit_Product', 'Avg_Account_Balance', 'Is_Active',
       'Is_Lead', 'source'],
      dtype='object')
#Checking shape 
df.shape
(351037, 12)
#Checking unique values 
df.nunique()
ID                     351037
Gender                      2
Age                        63
Region_Code                35
Occupation                  4
Channel_Code                4
Vintage                    66
Credit_Product              2
Avg_Account_Balance    162137
Is_Active                   2
Is_Lead                     2
source                      2
dtype: int64
#Check for Null Values
df.isnull().sum()
ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product          41847
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64

Observation:

Null values are present in Credit _Product column.

#Fill null values in Credit_Product feature
df['Credit_Product']= df['Credit_Product'].fillna("NA")
#Again check for null values
df.isnull().sum()
ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product              0
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64
#Checking Datatypes and info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351037 entries, 0 to 351036
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   351037 non-null  object 
 1   Gender               351037 non-null  object 
 2   Age                  351037 non-null  int64  
 3   Region_Code          351037 non-null  object 
 4   Occupation           351037 non-null  object 
 5   Channel_Code         351037 non-null  object 
 6   Vintage              351037 non-null  int64  
 7   Credit_Product       351037 non-null  object 
 8   Avg_Account_Balance  351037 non-null  int64  
 9   Is_Active            351037 non-null  object 
 10  Is_Lead              245725 non-null  float64
 11  source               351037 non-null  object 
dtypes: float64(1), int64(3), object(8)
memory usage: 32.1+ MB
#Changing Yes to 1 and No to 0 in Is_Active column to covert  data into float

df["Is_Active"].replace(["Yes","No"],[1,0],inplace=True)

df['Is_Active'] = df['Is_Active'].astype(float)
df.head()

image.png

image.png

#Now changing all categorical column into numerical form using label endcoding
cat_col=[ 'Gender', 'Region_Code', 'Occupation','Channel_Code', 'Credit_Product']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cat_col:
    df[col]= le.fit_transform(df[col])


df_2= df
df_2.head()

image.png

image.png


#Separating the train and test
df_train=df_2.loc[df_2["source"]=="train"]
df_test=df_2.loc[df_2["source"]=="test"]
df_1 = df_train
#we can drop column as they are irrelevant and have no effect on our data
df_1.drop(columns=['ID',"source"],inplace=True)
df_1.head()

image.png

image.png

2. 数据可视化

数据可视化有助于我们更好地理解数据的分布和特征。以下是一些常用的数据可视化方法:

import warnings
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize']  = (10,6)
plt.rcParams['font.size']  = 16
sns.set_style("whitegrid")

sns.distplot(df['Age']);

sns.distplot(df['Avg_Account_Balance'])
plt.show()

#Countplot for Gender feature 
# plt.figure(figsize=(8,4))
sns.countplot(df['Gender'],palette='Accent')
plt.show()

#Countplot for Target variable i.e 'Is_Lead'
target = 'Is_Lead'
# plt.figure(figsize=(8,4))
sns.countplot(df[target],palette='hls')
print(df[target].value_counts())
0.0    187437
1.0     58288
Name: Is_Lead, dtype: int64
plt.rcParams['figure.figsize']  = (12,6)
#Checking occupation with customers
# plt.figure(figsize=(8,4))
sns.countplot(x='Occupation',hue='Is_Lead',data=df,palette= 'magma')
plt.show()

#Plot showing Activness of customer in last 3 months with respect to Occupation of customer
# plt.figure(figsize=(8,4))
sns.catplot(y='Age',x='Occupation',hue='Is_Active',data=df,kind='bar',palette='Oranges')
plt.show()

3. 模型训练

我们将使用三个模型进行训练:XGBoost、LightGBM和随机森林。以下是模型的训练代码:

# To balance the dataset , we will apply undersampling method
from sklearn.utils import resample
# separate the minority and majority classes
df_majority = df_1[df_1['Is_Lead']==0]
df_minority = df_1[df_1['Is_Lead']==1]

print(" The majority class values are", len(df_majority))
print(" The minority class values are", len(df_minority))
print(" The ratio of both classes are", len(df_majority)/len(df_minority))
 The majority class values are 187437
 The minority class values are 58288
 The ratio of both classes are 3.215704776283283
# undersample majority class
df_majority_undersampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=0)
# combine minority class with oversampled majority class
df_undersampled = pd.concat([df_minority, df_majority_undersampled])

df_undersampled['Is_Lead'].value_counts()
df_1=df_undersampled

# display new class value counts
print(" The undersamples class values count is:", len(df_undersampled))
print(" The ratio of both classes are", len(df_undersampled[df_undersampled["Is_Lead"]==0])/len(df_undersampled[df_undersampled["Is_Lead"]==1]))

 The undersamples class values count is: 116576
 The ratio of both classes are 1.0
# dropping target variable 
#assign the value of y for training and testing phase
xc = df_1.drop(columns=['Is_Lead'])
yc = df_1[["Is_Lead"]]
df_1.head()

image.png

image.png


#Importing necessary libraries
from sklearn import metrics
from scipy.stats import zscore
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score,roc_curve
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

#Import warnings
import warnings
warnings.filterwarnings('ignore')
#Standardizing value of x by using standardscaler to make the data normally distributed
sc = StandardScaler()
df_xc = pd.DataFrame(sc.fit_transform(xc),columns=xc.columns)
df_xc.head()


image.png

image.png

#defining a function to find fit of the model

def max_accuracy_scr(names,model_c,df_xc,yc):
    accuracy_scr_max = 0
    roc_scr_max=0
    train_xc,test_xc,train_yc,test_yc = train_test_split(df_xc,yc,random_state = 42,test_size = 0.2,stratify = yc)
    model_c.fit(train_xc,train_yc)
    pred = model_c.predict_proba(test_xc)[:, 1]
    roc_score = roc_auc_score(test_yc, pred)
    accuracy_scr = accuracy_score(test_yc,model_c.predict(test_xc))
    if roc_score> roc_scr_max:
        roc_scr_max=roc_score
        final_model = model_c
        mean_acc = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").mean()
        std_dev = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").std()
        cross_val = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy")
    print("*"*50)
    print("Results for model : ",names,'\n',
          "max roc score correspond to random state " ,roc_scr_max ,'\n',
          "Mean accuracy score is : ",mean_acc,'\n',
          "Std deviation score is : ",std_dev,'\n',
          "Cross validation scores are :  " ,cross_val) 
    print(f"roc_auc_score: {roc_score}")
    print("*"*50)
#Now by using multiple Algorithms we are calculating the best Algo which performs best for our data set 
accuracy_scr_max = []
models=[]
#accuracy=[]
std_dev=[]
roc_auc=[]
mean_acc=[]
cross_val=[]
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('Decision Tree Classifier',DecisionTreeClassifier()))
models.append(("GausianNB",GaussianNB()))

for names,model_c in models:
    max_accuracy_scr(names,model_c,df_xc,yc)

**************************************************
Results for model :  Logistic Regression 
 max roc score correspond to random state  0.727315712597147 
 Mean accuracy score is :  0.6696918411779096 
 Std deviation score is :  0.0030322593046897828 
 Cross validation scores are :   [0.67361469 0.66566588 0.66703839 0.67239974 0.66974051]
roc_auc_score: 0.727315712597147
**************************************************
**************************************************
Results for model :  Random Forest 
 max roc score correspond to random state  0.8792762631904103 
 Mean accuracy score is :  0.8117279862602139 
 Std deviation score is :  0.002031698139189051 
 Cross validation scores are :   [0.81043061 0.81162342 0.81158053 0.81115162 0.81616985]
roc_auc_score: 0.8792762631904103
**************************************************
**************************************************
Results for model :  Decision Tree Classifier 
 max roc score correspond to random state  0.7397495282209642 
 Mean accuracy score is :  0.7426399792028343 
 Std deviation score is :  0.0025271129138200485 
 Cross validation scores are :   [0.74288043 0.74162556 0.74149689 0.73870899 0.74462792]
roc_auc_score: 0.7397495282209642
**************************************************
**************************************************
Results for model :  GausianNB 
 max roc score correspond to random state  0.7956111563031266 
 Mean accuracy score is :  0.7158677336619202 
 Std deviation score is :  0.0015884106712636206 
 Cross validation scores are :   [0.71894836 0.71550504 0.71546215 0.71443277 0.71499035]
roc_auc_score: 0.7956111563031266
**************************************************

First Attempt:Random Forest Classifier

# Estimating best n_estimator using grid search for Randomforest Classifier
parameters={"n_estimators":[1,10,100]}
rf_clf=RandomForestClassifier()
clf = GridSearchCV(rf_clf, parameters, cv=5,scoring="roc_auc")
clf.fit(df_xc,yc)
print("Best parameter : ",clf.best_params_,"\nBest Estimator : ", clf.best_estimator_,"\nBest Score : ", clf.best_score_)
Best parameter :  {'n_estimators': 100} 
Best Estimator :  RandomForestClassifier() 
Best Score :  0.8810508979668068
#Again running RFC with n_estimator = 100
rf_clf=RandomForestClassifier(n_estimators=100,random_state=42)
max_accuracy_scr("RandomForest Classifier",rf_clf,df_xc,yc)
**************************************************
Results for model :  RandomForest Classifier 
 max roc score correspond to random state  0.879415808805665 
 Mean accuracy score is :  0.8115392510996895 
 Std deviation score is :  0.0008997445291505284 
 Cross validation scores are :   [0.81180305 0.81136607 0.81106584 0.81037958 0.81308171]
roc_auc_score: 0.879415808805665
**************************************************
xc_train,xc_test,yc_train,yc_test=train_test_split(df_xc, yc,random_state = 80,test_size=0.20,stratify=yc)
rf_clf.fit(xc_train,yc_train)
yc_pred=rf_clf.predict(xc_test)
plt.rcParams['figure.figsize']  = (12,8)
#  Random Forest Classifier Results

pred_pb=rf_clf.predict_proba(xc_test)[:,1]
Fpr,Tpr,thresholds = roc_curve(yc_test,pred_pb,pos_label=True)
auc = roc_auc_score(yc_test,pred_pb)

print(" ROC_AUC score is ",auc)
print("accuracy score is : ",accuracy_score(yc_test,yc_pred))
print("Precision is : " ,precision_score(yc_test, yc_pred))
print("Recall is: " ,recall_score(yc_test, yc_pred))
print("F1 Score is : " ,f1_score(yc_test, yc_pred))
print("classification report \n",classification_report(yc_test,yc_pred))

#Plotting confusion matrix
cnf = confusion_matrix(yc_test,yc_pred)
sns.heatmap(cnf, annot=True, cmap = "magma")
 ROC_AUC score is  0.8804566893762799
accuracy score is :  0.8127466117687425
Precision is :  0.8397949673811743
Recall is:  0.7729456167438669
F1 Score is :  0.8049848132928354
classification report 
               precision    recall  f1-score   support

         0.0       0.79      0.85      0.82     11658
         1.0       0.84      0.77      0.80     11658

    accuracy                           0.81     23316
   macro avg       0.81      0.81      0.81     23316
weighted avg       0.81      0.81      0.81     23316






<AxesSubplot:>

plt.rcParams['figure.figsize']  = (12,6)
#plotting the graph for area under curve for representing accuracy of data
plt.plot([0,1],[1,0],'g--')
plt.plot(Fpr,Tpr)
plt.xlabel('False_Positive_Rate')
plt.ylabel('True_Positive_Rate')
plt.title("Random Forest Classifier")
plt.show()

基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)(二):https://developer.aliyun.com/article/1535286

相关文章
|
21天前
|
人工智能 JSON 算法
Qwen2.5-Coder 系列模型在 PAI-QuickStart 的训练、评测、压缩及部署实践
阿里云的人工智能平台 PAI,作为一站式、 AI Native 的大模型与 AIGC 工程平台,为开发者和企业客户提供了 Qwen2.5-Coder 系列模型的全链路最佳实践。本文以Qwen2.5-Coder-32B为例,详细介绍在 PAI-QuickStart 完成 Qwen2.5-Coder 的训练、评测和快速部署。
Qwen2.5-Coder 系列模型在 PAI-QuickStart 的训练、评测、压缩及部署实践
|
5天前
|
编解码 机器人 测试技术
技术实践 | 使用 PAI+LLaMA Factory 微调 Qwen2-VL 模型快速搭建专业领域知识问答机器人
Qwen2-VL是一款具备高级图像和视频理解能力的多模态模型,支持多种语言,适用于多模态应用开发。通过PAI和LLaMA Factory框架,用户可以轻松微调Qwen2-VL模型,快速构建文旅领域的知识问答机器人。本教程详细介绍了从模型部署、微调到对话测试的全过程,帮助开发者高效实现定制化多模态应用。
|
25天前
|
机器学习/深度学习 PyTorch API
优化注意力层提升 Transformer 模型效率:通过改进注意力机制降低机器学习成本
Transformer架构自2017年被Vaswani等人提出以来,凭借其核心的注意力机制,已成为AI领域的重大突破。该机制允许模型根据任务需求灵活聚焦于输入的不同部分,极大地增强了对复杂语言和结构的理解能力。起初主要应用于自然语言处理,Transformer迅速扩展至语音识别、计算机视觉等多领域,展现出强大的跨学科应用潜力。然而,随着模型规模的增长,注意力层的高计算复杂度成为发展瓶颈。为此,本文探讨了在PyTorch生态系统中优化注意力层的各种技术,
50 6
优化注意力层提升 Transformer 模型效率:通过改进注意力机制降低机器学习成本
|
14天前
|
机器学习/深度学习 人工智能 算法
人工智能浪潮下的编程实践:构建你的第一个机器学习模型
在人工智能的巨浪中,每个人都有机会成为弄潮儿。本文将带你一探究竟,从零基础开始,用最易懂的语言和步骤,教你如何构建属于自己的第一个机器学习模型。不需要复杂的数学公式,也不必担心编程难题,只需跟随我们的步伐,一起探索这个充满魔力的AI世界。
33 12
|
21天前
|
机器学习/深度学习 Python
机器学习中评估模型性能的重要工具——混淆矩阵和ROC曲线。混淆矩阵通过真正例、假正例等指标展示模型预测情况
本文介绍了机器学习中评估模型性能的重要工具——混淆矩阵和ROC曲线。混淆矩阵通过真正例、假正例等指标展示模型预测情况,而ROC曲线则通过假正率和真正率评估二分类模型性能。文章还提供了Python中的具体实现示例,展示了如何计算和使用这两种工具来评估模型。
41 8
|
21天前
|
机器学习/深度学习 Python
机器学习中模型选择和优化的关键技术——交叉验证与网格搜索
本文深入探讨了机器学习中模型选择和优化的关键技术——交叉验证与网格搜索。介绍了K折交叉验证、留一交叉验证等方法,以及网格搜索的原理和步骤,展示了如何结合两者在Python中实现模型参数的优化,并强调了使用时需注意的计算成本、过拟合风险等问题。
41 6
|
24天前
|
机器学习/深度学习 数据采集 算法
从零到一:构建高效机器学习模型的旅程####
在探索技术深度与广度的征途中,我深刻体会到技术创新既在于理论的飞跃,更在于实践的积累。本文将通过一个具体案例,分享我在构建高效机器学习模型过程中的实战经验,包括数据预处理、特征工程、模型选择与优化等关键环节,旨在为读者提供一个从零开始构建并优化机器学习模型的实用指南。 ####
|
28天前
|
人工智能 边缘计算 JSON
DistilQwen2 蒸馏小模型在 PAI-QuickStart 的训练、评测、压缩及部署实践
本文详细介绍在 PAI 平台使用 DistilQwen2 蒸馏小模型的全链路最佳实践。
|
25天前
|
机器学习/深度学习 人工智能 算法
探索机器学习中的线性回归模型
本文深入探讨了机器学习中广泛使用的线性回归模型,从其基本概念和数学原理出发,逐步引导读者理解模型的构建、训练及评估过程。通过实例分析与代码演示,本文旨在为初学者提供一个清晰的学习路径,帮助他们在实践中更好地应用线性回归模型解决实际问题。
|
29天前
|
机器学习/深度学习 自然语言处理 语音技术
探索机器学习中的深度学习模型:原理与应用
探索机器学习中的深度学习模型:原理与应用
35 0