基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)
随着数据科学和机器学习的发展,越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中,我将分享如何利用机器学习模型来预测信用卡的潜在客户。此项目基于我整理的代码和文件,涉及数据预处理、数据可视化、模型训练、预测及结果保存的完整流程。
项目概述
本项目旨在使用机器学习模型预测哪些客户最有可能成为信用卡的潜在客户。我们将使用三个主要的机器学习模型:XGBoost、LightGBM和随机森林(Random Forest)。以下是项目的主要步骤:
1、数据预处理
2、数据可视化
3、模型训练
4、模型预测
5、模型保存
1. 数据预处理
数据预处理是机器学习项目中至关重要的一步。通过清洗和准备数据,我们可以提高模型的性能和准确性。
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns
#Loading the dataset df_train=pd.read_csv("dataset/train_s3TEQDk.csv") df_train["source"]="train" df_test=pd.read_csv("dataset/test_mSzZ8RL.csv") df_test["source"]="test" df=pd.concat([df_train,df_test],ignore_index=True) df.head()
1. Checking and Cleaning Dataset :
#Checking columns of dataset df.columns
Index(['ID', 'Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code', 'Vintage', 'Credit_Product', 'Avg_Account_Balance', 'Is_Active', 'Is_Lead', 'source'], dtype='object')
#Checking shape df.shape
(351037, 12)
#Checking unique values df.nunique()
ID 351037 Gender 2 Age 63 Region_Code 35 Occupation 4 Channel_Code 4 Vintage 66 Credit_Product 2 Avg_Account_Balance 162137 Is_Active 2 Is_Lead 2 source 2 dtype: int64
#Check for Null Values df.isnull().sum()
ID 0 Gender 0 Age 0 Region_Code 0 Occupation 0 Channel_Code 0 Vintage 0 Credit_Product 41847 Avg_Account_Balance 0 Is_Active 0 Is_Lead 105312 source 0 dtype: int64
Observation:
Null values are present in Credit _Product column.
#Fill null values in Credit_Product feature df['Credit_Product']= df['Credit_Product'].fillna("NA")
#Again check for null values df.isnull().sum()
ID 0 Gender 0 Age 0 Region_Code 0 Occupation 0 Channel_Code 0 Vintage 0 Credit_Product 0 Avg_Account_Balance 0 Is_Active 0 Is_Lead 105312 source 0 dtype: int64
#Checking Datatypes and info df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 351037 entries, 0 to 351036 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 351037 non-null object 1 Gender 351037 non-null object 2 Age 351037 non-null int64 3 Region_Code 351037 non-null object 4 Occupation 351037 non-null object 5 Channel_Code 351037 non-null object 6 Vintage 351037 non-null int64 7 Credit_Product 351037 non-null object 8 Avg_Account_Balance 351037 non-null int64 9 Is_Active 351037 non-null object 10 Is_Lead 245725 non-null float64 11 source 351037 non-null object dtypes: float64(1), int64(3), object(8) memory usage: 32.1+ MB
#Changing Yes to 1 and No to 0 in Is_Active column to covert data into float df["Is_Active"].replace(["Yes","No"],[1,0],inplace=True) df['Is_Active'] = df['Is_Active'].astype(float) df.head()
#Now changing all categorical column into numerical form using label endcoding cat_col=[ 'Gender', 'Region_Code', 'Occupation','Channel_Code', 'Credit_Product'] from sklearn.preprocessing import LabelEncoder le = LabelEncoder() for col in cat_col: df[col]= le.fit_transform(df[col]) df_2= df df_2.head()
#Separating the train and test df_train=df_2.loc[df_2["source"]=="train"] df_test=df_2.loc[df_2["source"]=="test"] df_1 = df_train
#we can drop column as they are irrelevant and have no effect on our data df_1.drop(columns=['ID',"source"],inplace=True) df_1.head()
2. 数据可视化
数据可视化有助于我们更好地理解数据的分布和特征。以下是一些常用的数据可视化方法:
import warnings warnings.filterwarnings("ignore") plt.rcParams['figure.figsize'] = (10,6) plt.rcParams['font.size'] = 16 sns.set_style("whitegrid")
sns.distplot(df['Age']);
sns.distplot(df['Avg_Account_Balance']) plt.show()
#Countplot for Gender feature # plt.figure(figsize=(8,4)) sns.countplot(df['Gender'],palette='Accent') plt.show()
#Countplot for Target variable i.e 'Is_Lead' target = 'Is_Lead' # plt.figure(figsize=(8,4)) sns.countplot(df[target],palette='hls') print(df[target].value_counts())
0.0 187437 1.0 58288 Name: Is_Lead, dtype: int64
plt.rcParams['figure.figsize'] = (12,6)
#Checking occupation with customers # plt.figure(figsize=(8,4)) sns.countplot(x='Occupation',hue='Is_Lead',data=df,palette= 'magma') plt.show()
#Plot showing Activness of customer in last 3 months with respect to Occupation of customer # plt.figure(figsize=(8,4)) sns.catplot(y='Age',x='Occupation',hue='Is_Active',data=df,kind='bar',palette='Oranges') plt.show()
3. 模型训练
我们将使用三个模型进行训练:XGBoost、LightGBM和随机森林。以下是模型的训练代码:
# To balance the dataset , we will apply undersampling method from sklearn.utils import resample # separate the minority and majority classes df_majority = df_1[df_1['Is_Lead']==0] df_minority = df_1[df_1['Is_Lead']==1] print(" The majority class values are", len(df_majority)) print(" The minority class values are", len(df_minority)) print(" The ratio of both classes are", len(df_majority)/len(df_minority))
The majority class values are 187437 The minority class values are 58288 The ratio of both classes are 3.215704776283283
# undersample majority class df_majority_undersampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=0) # combine minority class with oversampled majority class df_undersampled = pd.concat([df_minority, df_majority_undersampled]) df_undersampled['Is_Lead'].value_counts() df_1=df_undersampled # display new class value counts print(" The undersamples class values count is:", len(df_undersampled)) print(" The ratio of both classes are", len(df_undersampled[df_undersampled["Is_Lead"]==0])/len(df_undersampled[df_undersampled["Is_Lead"]==1]))
The undersamples class values count is: 116576 The ratio of both classes are 1.0
# dropping target variable #assign the value of y for training and testing phase xc = df_1.drop(columns=['Is_Lead']) yc = df_1[["Is_Lead"]]
df_1.head()
#Importing necessary libraries from sklearn import metrics from scipy.stats import zscore from sklearn.preprocessing import LabelEncoder,StandardScaler from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.decomposition import PCA from sklearn.metrics import precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score, roc_curve from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score,roc_curve from sklearn.metrics import auc from sklearn.metrics import plot_roc_curve from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.naive_bayes import GaussianNB #Import warnings import warnings warnings.filterwarnings('ignore')
#Standardizing value of x by using standardscaler to make the data normally distributed sc = StandardScaler() df_xc = pd.DataFrame(sc.fit_transform(xc),columns=xc.columns) df_xc.head()
#defining a function to find fit of the model def max_accuracy_scr(names,model_c,df_xc,yc): accuracy_scr_max = 0 roc_scr_max=0 train_xc,test_xc,train_yc,test_yc = train_test_split(df_xc,yc,random_state = 42,test_size = 0.2,stratify = yc) model_c.fit(train_xc,train_yc) pred = model_c.predict_proba(test_xc)[:, 1] roc_score = roc_auc_score(test_yc, pred) accuracy_scr = accuracy_score(test_yc,model_c.predict(test_xc)) if roc_score> roc_scr_max: roc_scr_max=roc_score final_model = model_c mean_acc = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").mean() std_dev = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").std() cross_val = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy") print("*"*50) print("Results for model : ",names,'\n', "max roc score correspond to random state " ,roc_scr_max ,'\n', "Mean accuracy score is : ",mean_acc,'\n', "Std deviation score is : ",std_dev,'\n', "Cross validation scores are : " ,cross_val) print(f"roc_auc_score: {roc_score}") print("*"*50)
#Now by using multiple Algorithms we are calculating the best Algo which performs best for our data set accuracy_scr_max = [] models=[] #accuracy=[] std_dev=[] roc_auc=[] mean_acc=[] cross_val=[] models.append(('Logistic Regression', LogisticRegression())) models.append(('Random Forest', RandomForestClassifier())) models.append(('Decision Tree Classifier',DecisionTreeClassifier())) models.append(("GausianNB",GaussianNB())) for names,model_c in models: max_accuracy_scr(names,model_c,df_xc,yc)
************************************************** Results for model : Logistic Regression max roc score correspond to random state 0.727315712597147 Mean accuracy score is : 0.6696918411779096 Std deviation score is : 0.0030322593046897828 Cross validation scores are : [0.67361469 0.66566588 0.66703839 0.67239974 0.66974051] roc_auc_score: 0.727315712597147 ************************************************** ************************************************** Results for model : Random Forest max roc score correspond to random state 0.8792762631904103 Mean accuracy score is : 0.8117279862602139 Std deviation score is : 0.002031698139189051 Cross validation scores are : [0.81043061 0.81162342 0.81158053 0.81115162 0.81616985] roc_auc_score: 0.8792762631904103 ************************************************** ************************************************** Results for model : Decision Tree Classifier max roc score correspond to random state 0.7397495282209642 Mean accuracy score is : 0.7426399792028343 Std deviation score is : 0.0025271129138200485 Cross validation scores are : [0.74288043 0.74162556 0.74149689 0.73870899 0.74462792] roc_auc_score: 0.7397495282209642 ************************************************** ************************************************** Results for model : GausianNB max roc score correspond to random state 0.7956111563031266 Mean accuracy score is : 0.7158677336619202 Std deviation score is : 0.0015884106712636206 Cross validation scores are : [0.71894836 0.71550504 0.71546215 0.71443277 0.71499035] roc_auc_score: 0.7956111563031266 **************************************************
First Attempt:Random Forest Classifier
# Estimating best n_estimator using grid search for Randomforest Classifier parameters={"n_estimators":[1,10,100]} rf_clf=RandomForestClassifier() clf = GridSearchCV(rf_clf, parameters, cv=5,scoring="roc_auc") clf.fit(df_xc,yc) print("Best parameter : ",clf.best_params_,"\nBest Estimator : ", clf.best_estimator_,"\nBest Score : ", clf.best_score_)
Best parameter : {'n_estimators': 100} Best Estimator : RandomForestClassifier() Best Score : 0.8810508979668068
#Again running RFC with n_estimator = 100 rf_clf=RandomForestClassifier(n_estimators=100,random_state=42) max_accuracy_scr("RandomForest Classifier",rf_clf,df_xc,yc)
************************************************** Results for model : RandomForest Classifier max roc score correspond to random state 0.879415808805665 Mean accuracy score is : 0.8115392510996895 Std deviation score is : 0.0008997445291505284 Cross validation scores are : [0.81180305 0.81136607 0.81106584 0.81037958 0.81308171] roc_auc_score: 0.879415808805665 **************************************************
xc_train,xc_test,yc_train,yc_test=train_test_split(df_xc, yc,random_state = 80,test_size=0.20,stratify=yc) rf_clf.fit(xc_train,yc_train) yc_pred=rf_clf.predict(xc_test)
plt.rcParams['figure.figsize'] = (12,8)
# Random Forest Classifier Results pred_pb=rf_clf.predict_proba(xc_test)[:,1] Fpr,Tpr,thresholds = roc_curve(yc_test,pred_pb,pos_label=True) auc = roc_auc_score(yc_test,pred_pb) print(" ROC_AUC score is ",auc) print("accuracy score is : ",accuracy_score(yc_test,yc_pred)) print("Precision is : " ,precision_score(yc_test, yc_pred)) print("Recall is: " ,recall_score(yc_test, yc_pred)) print("F1 Score is : " ,f1_score(yc_test, yc_pred)) print("classification report \n",classification_report(yc_test,yc_pred)) #Plotting confusion matrix cnf = confusion_matrix(yc_test,yc_pred) sns.heatmap(cnf, annot=True, cmap = "magma")
ROC_AUC score is 0.8804566893762799 accuracy score is : 0.8127466117687425 Precision is : 0.8397949673811743 Recall is: 0.7729456167438669 F1 Score is : 0.8049848132928354 classification report precision recall f1-score support 0.0 0.79 0.85 0.82 11658 1.0 0.84 0.77 0.80 11658 accuracy 0.81 23316 macro avg 0.81 0.81 0.81 23316 weighted avg 0.81 0.81 0.81 23316 <AxesSubplot:>
plt.rcParams['figure.figsize'] = (12,6)
#plotting the graph for area under curve for representing accuracy of data plt.plot([0,1],[1,0],'g--') plt.plot(Fpr,Tpr) plt.xlabel('False_Positive_Rate') plt.ylabel('True_Positive_Rate') plt.title("Random Forest Classifier") plt.show()
基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)(二):https://developer.aliyun.com/article/1535286