在数据科学的世界里,深度学习方法无疑是最先进的研究。每天都有许多新的变化被发明和实现,特别是在自然语言处理(NLP)和计算机视觉(CV)领域,深度学习在近年来取得了巨大的进步。这种趋势也可以在Kaggle比赛中观察到。在这些NLP和CV任务竞赛中,最近获胜的解决方案是利用深度学习模型。
然而,深度学习模型真的比GBDT(梯度提升决策树)这样的“传统”机器学习模型更好吗?我们知道,正如上面提到的,深度学习模型在NLP和CV中要好得多,但在现实生活中,我们仍然有很多表格数据,我们是否可以确认,即使在结构化数据集上,深度学习模型也比GBDT模型表现得更好?为了回答这个问题,本文使用Kaggle的家庭保险数据集来比较每个模型的性能。我知道我们不能仅仅通过一个数据集就得出哪个模型更好的结论,但这将是一个很好的起点来查看比较。此外,我将使用TabNet,这是一个相对较新的表格数据深度学习模型来进行比较。
这个实验的笔记本可以在我的kaggle代码中找到(https://www.kaggle.com/kyosukemorita/deep-learning-vs-gbdt-model-on-tabular-data)。这篇文章将省略对每个算法的解释,因为已经有很多这样的算法了:)
代码片段
如上所述,本实验使用了家庭保险数据集。这个数据集包括2007年到2012年的家庭保险政策数据,有超过100个特征,关于家庭特征,业主的人口统计等,在这个数据中有超过25万行。利用这个数据集,这个实验试图预测一份家庭保险政策是否会失效。不幸的是,并没有给出这个数据集中所有变量的细节,但是做这个实验已经足够好了。
现在我将通过这个实验的代码。这是最低限度的,应该是显而易见的。
首先,导入库。
importosimportnumpyasnpimportpandasaspdimportwarningsimporttimefromdatetimeimportdatetimefromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportStandardScalerimporttorchfrompytorch_tabnet.pretrainingimportTabNetPretrainerfrompytorch_tabnet.tab_modelimportTabNetClassifierfromsklearn.metricsimportaccuracy_score, f1_score, confusion_matrix, roc_auc_scoreimportxgboostasxgbfromtensorflowimportkerasfromtensorflow.kerasimportlayersimportmatplotlib.pyplotaspltwarnings.filterwarnings("ignore")
然后,创建一个计时函数,
deftimer(myFunction): deffunctionTimer(*args, **kwargs): start_time=time.time() result=myFunction(*args, **kwargs) end_time=time.time() computation_time=round(end_time-start_time, 2) print("{} is excuted".format(myFunction.__name__)) print('Computation took: {:.2f} seconds'.format(computation_time)) returnresultreturnfunctionTimer
这样我们就可以追踪函数运行的时间。
现在,这是数据集的预处理。这个过程包括
- 排除缺失值
- 清除目标变量
- 为分类变量创建虚拟变量
- 创建年龄特征
- 处理其他缺失值
defprepareInputs(df: "pd.dataFrame") ->"pd.dataFrame": """Prepare the input for trainingArgs:df (pd.DataFrame): raw dataProcess:1. Exclude missing values2. Clean the target variable3. Create dummy variables for categorical variables4. Create age features5. Impute missing valueReturn: pd.dataFrame"""#1.Excludemissingvaluesdf=df[df["POL_STATUS"].notnull()] #2.Cleanthetargetvariabledf=df[df["POL_STATUS"] !="Unknown"] df["lapse"] =np.where(df["POL_STATUS"] =="Lapsed", 1, 0) #3.Createdummyvariablesforcategoricalvariablescategorical_cols= ["CLAIM3YEARS", "BUS_USE", "AD_BUILDINGS", "APPR_ALARM", "CONTENTS_COVER", "P1_SEX", "BUILDINGS_COVER", "P1_POLICY_REFUSED", "APPR_LOCKS", "FLOODING", "NEIGH_WATCH", "SAFE_INSTALLED", "SEC_DISC_REQ", "SUBSIDENCE", "LEGAL_ADDON_POST_REN", "HOME_EM_ADDON_PRE_REN","HOME_EM_ADDON_POST_REN", "GARDEN_ADDON_PRE_REN", "GARDEN_ADDON_POST_REN", "KEYCARE_ADDON_PRE_REN", "KEYCARE_ADDON_POST_REN", "HP1_ADDON_PRE_REN", "HP1_ADDON_POST_REN", "HP2_ADDON_PRE_REN", "HP2_ADDON_POST_REN", "HP3_ADDON_PRE_REN", "HP3_ADDON_POST_REN", "MTA_FLAG", "OCC_STATUS", "OWNERSHIP_TYPE", "PROP_TYPE", "PAYMENT_METHOD", "P1_EMP_STATUS", "P1_MAR_STATUS" ] forcolincategorical_cols: dummies=pd.get_dummies(df[col], drop_first=True, prefix=col ) df=pd.concat([df, dummies], 1) #4.Createagefeaturesdf["age"] = (datetime.strptime("2013-01-01", "%Y-%m-%d") -pd.to_datetime(df["P1_DOB"])).dt.days// 365df["property_age"] =2013-df["YEARBUILT"] df["cover_length"] =2013-pd.to_datetime(df["COVER_START"]).dt.year#5.Imputemissingvaluedf["RISK_RATED_AREA_B_imputed"] =df["RISK_RATED_AREA_B"].fillna(df["RISK_RATED_AREA_B"].mean()) df["RISK_RATED_AREA_C_imputed"] =df["RISK_RATED_AREA_C"].fillna(df["RISK_RATED_AREA_C"].mean()) df["MTA_FAP_imputed"] =df["MTA_FAP"].fillna(0) df["MTA_APRP_imputed"] =df["MTA_APRP"].fillna(0) returndf
为了训练和评估模型,我将分割数据集。
#SplittrainandtestdefsplitData(df: "pd.DataFrame", FEATS: "list"): """Split the dataframe into train and testArgs:df: preprocessed dataframeFEATS: feature listReturns:X_train, y_train, X_test, y_test"""train, test=train_test_split(df, test_size= .3, random_state=42) train, test=prepareInputs(train), prepareInputs(test) returntrain[FEATS], train["lapse"], test[FEATS], test["lapse"]
用深度学习模型训练数字特征需要标准化。这个函数完成了这项工作。
#StandardisethedatasetsdefstandardiseNumericalFeats(X_train, X_test): """Standardise the numerical featuresReturns:Standardised X_train and X_test"""numerical_cols= [ "age", "property_age", "cover_length", "RISK_RATED_AREA_B_imputed", "RISK_RATED_AREA_C_imputed", "MTA_FAP_imputed", "MTA_APRP_imputed", "SUM_INSURED_BUILDINGS", "NCD_GRANTED_YEARS_B", "SUM_INSURED_CONTENTS", "NCD_GRANTED_YEARS_C", "SPEC_SUM_INSURED", "SPEC_ITEM_PREM", "UNSPEC_HRP_PREM", "BEDROOMS", "MAX_DAYS_UNOCC", "LAST_ANN_PREM_GROSS" ] forcolinnumerical_cols: scaler=StandardScaler() X_train[col] =scaler.fit_transform(X_train[[col]]) X_test[col] =scaler.transform(X_test[[col]]) returnX_train, X_test
现在所有的预处理都完成了。下面,我将展示每个模型的训练代码。
XGBoostdeftrainXgbModel(X_train, y_train, X_test, y_test, FEATS, ROUNDS) ->"XGBoost model obj": """Train XGBoost modelArg:ROUNDS: Number of training roundsReturn:Model object"""params= { 'eta': 0.02, 'max_depth': 10, 'min_child_weight': 7, 'subsample': 0.6, 'objective': 'binary:logistic', 'eval_metric': 'error', 'grow_policy': 'lossguide' } dtrain, dtest=xgb.DMatrix(X_train, y_train, feature_names=FEATS), xgb.DMatrix(X_test, y_test, feature_names=FEATS) EVAL_LIST= [(dtrain, "train"),(dtest, "test")] xgb_model=xgb.train(params,dtrain,ROUNDS,EVAL_LIST) returnxgb_modelMLP (1D-CNN) deftrainD1CnnModel(X_train, y_train): """Train D1-CNN modelReturn:keras model obj"""d1_cnn_model=keras.Sequential([ layers.Dense(4096, activation='relu'), layers.Reshape((256, 16)), layers.BatchNormalization(), layers.Dropout(0.2), layers.Conv1D(filters=16, kernel_size=5, strides=1, activation='relu'), layers.MaxPooling1D(pool_size=2), layers.Flatten(), layers.Dense(16, activation='relu'), layers.Dense(1, activation='sigmoid'), ]) d1_cnn_model.compile( optimizer=keras.optimizers.Adam(learning_rate=3e-3), loss='binary_crossentropy', metrics=[keras.metrics.BinaryCrossentropy()] ) early_stopping=keras.callbacks.EarlyStopping( patience=25, min_delta=0.001, restore_best_weights=True, ) d1_cnn_model.fit( X_train, y_train, batch_size=10000, epochs=5000, callbacks=[early_stopping], validation_data=(X_test, y_test), ) returnd1_cnn_modelTabNetdeftrainTabNetModel(X_train, y_train, pretrainer): """Train TabNet modelArgs:pretrainer: pretrained model. If not using this, use NoneReturn:TabNet model obj"""tabNet_model=TabNetClassifier( n_d=16, n_a=16, n_steps=4, gamma=1.9, n_independent=4, n_shared=5, seed=42, optimizer_fn=torch.optim.Adam, scheduler_params= {"milestones": [150,250,300,350,400,450],'gamma':0.2}, scheduler_fn=torch.optim.lr_scheduler.MultiStepLR ) tabNet_model.fit( X_train=X_train.to_numpy(), y_train=y_train.to_numpy(), eval_set=[(X_train.to_numpy(), y_train.to_numpy()), (X_test.to_numpy(), y_test.to_numpy())], max_epochs=100, batch_size=256, patience=10, from_unsupervised=pretrainer ) returntabNet_model