3.3.6 特征编码
labelEncode 直接放入树模型中
#label-encode:subGrade,postCode,title # 高维类别特征需要进行转换 for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']): le = LabelEncoder() le.fit(list(data_train[col].astype(str).values) + list(data_test_a[col].astype(str).values)) data_train[col] = le.transform(list(data_train[col].astype(str).values)) data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values)) print('Label Encoding 完成')
100%|██████████| 4/4 [00:08<00:00, 2.04s/it] Label Encoding 完成
逻辑回归等模型要单独增加的特征工程
- 对特征做归一化,去除相关性高的特征
- 归一化目的是让训练过程更好更快的收敛,避免特征大吃小的问题
- 去除相关性是增加模型的可解释性,加快预测过程。
# 举例归一化过程 #伪代码 for fea in [要归一化的特征列表]: data[fea] = ((data[fea] - np.min(data[fea])) / (np.max(data[fea]) - np.min(data[fea])))
3.3.7 特征选择
- 特征选择技术可以精简掉无用的特征,以降低最终模型的复杂性,它的最终目的是得到一个简约模型,在不降低预测准确率或对预测准确率影响不大的情况下提高计算速度。特征选择不是为了减少训练时间(实际上,一些技术会增加总体训练时间),而是为了减少模型评分时间。
特征选择的方法:
- 1 Filter
- 方差选择法
- 相关系数法(pearson 相关系数)
- 卡方检验
- 互信息法
- 2 Wrapper (RFE)
- 递归特征消除法
- 3 Embedded
- 基于惩罚项的特征选择法
- 基于树模型的特征选择
Filter
- 基于特征间的关系进行筛选
方差选择法
- 方差选择法中,先要计算各个特征的方差,然后根据设定的阈值,选择方差大于阈值的特征
from sklearn.feature_selection import VarianceThreshold #其中参数threshold为方差的阈值 VarianceThreshold(threshold=3).fit_transform(train,target_train)
相关系数法
- Pearson 相关系数
皮尔森相关系数是一种最简单的,可以帮助理解特征和响应变量之间关系的方法,该方法衡量的是变量之间的线性相关性。
结果的取值区间为 [-1,1] , -1 表示完全的负相关, +1表示完全的正相关,0 表示没有线性相关。
from sklearn.feature_selection import SelectKBest from scipy.stats import pearsonr #选择K个最好的特征,返回选择特征后的数据 #第一个参数为计算评估特征是否好的函数,该函数输入特征矩阵和目标向量, #输出二元组(评分,P值)的数组,数组第i项为第i个特征的评分和P值。在此定义为计算相关系数 #参数k为选择的特征个数 SelectKBest(k=5).fit_transform(train,target_train)
卡方检验
- 经典的卡方检验是用于检验自变量对因变量的相关性。 假设自变量有N种取值,因变量有M种取值,考虑自变量等于i且因变量等于j的样本频数的观察值与期望的差距。 其统计量如下: χ2=∑(A−T)2T,其中A为实际值,T为理论值
- (注:卡方只能运用在正定矩阵上,否则会报错Input X must be non-negative)
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 #参数k为选择的特征个数 SelectKBest(chi2, k=5).fit_transform(train,target_train)
互信息法
- 经典的互信息也是评价自变量对因变量的相关性的。 在feature_selection库的SelectKBest类结合最大信息系数法可以用于选择特征,相关代码如下:
from sklearn.feature_selection import SelectKBest from minepy import MINE #由于MINE的设计不是函数式的,定义mic方法将其为函数式的, #返回一个二元组,二元组的第2项设置成固定的P值0.5 def mic(x, y): m = MINE() m.compute_score(x, y) return (m.mic(), 0.5) #参数k为选择的特征个数 SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(train,target_train)
Wrapper (Recursive feature elimination,RFE)
- 递归特征消除法 递归消除特征法使用一个基模型来进行多轮训练,每轮训练后,消除若干权值系数的特征,再基于新的特征集进行下一轮训练。 在feature_selection库的RFE类可以用于选择特征,相关代码如下(以逻辑回归为例):
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression #递归特征消除法,返回特征选择后的数据 #参数estimator为基模型 #参数n_features_to_select为选择的特征个数 RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(train,target_train)
Embedded
- 基于惩罚项的特征选择法 使用带惩罚项的基模型,除了筛选出特征外,同时也进行了降维。 在feature_selection库的SelectFromModel类结合逻辑回归模型可以用于选择特征,相关代码如下:
from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LogisticRegression #带L1惩罚项的逻辑回归作为基模型的特征选择 SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(train,target_train)
- 基于树模型的特征选择 树模型中GBDT也可用来作为基模型进行特征选择。 在feature_selection库的SelectFromModel类结合GBDT模型可以用于选择特征,相关代码如下:
from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import GradientBoostingClassifier #GBDT作为基模型的特征选择 SelectFromModel(GradientBoostingClassifier()).fit_transform(train,target_train)
本数据集中我们删除非入模特征后,并对缺失值填充,然后用计算协方差的方式看一下特征间相关性,然后进行模型训练
# 删除不需要的数据 for data in [data_train, data_test_a]: data.drop(['issueDate','id'], axis=1,inplace=True)
"纵向用缺失值上面的值替换缺失值" data_train = data_train.fillna(axis=0,method='ffill')
x_train = data_train.drop(['isDefault','id'], axis=1) #计算协方差 data_corr = x_train.corrwith(data_train.isDefault) #计算相关性 result = pd.DataFrame(columns=['features', 'corr']) result['features'] = data_corr.index result['corr'] = data_corr.values
# 当然也可以直接看图 data_numeric = data_train[numerical_fea] correlation = data_numeric.corr() f , ax = plt.subplots(figsize = (7, 7)) plt.title('Correlation of Numeric Features with Price',y=1,size=16) sns.heatmap(correlation,square = True, vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x12d88ad10>
output_81_1.png
features = [f for f in data_train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f] x_train = data_train[features] x_test = data_test_a[features] y_train = data_train['isDefault']
def cv_model(clf, train_x, train_y, test_x, clf_name): folds = 5 seed = 2020 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] if clf_name == "lgb": train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 4, 'learning_rate': 0.1, 'seed': 2020, 'nthread': 28, 'n_jobs':24, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20]) if clf_name == "xgb": train_matrix = clf.DMatrix(trn_x , label=trn_y) valid_matrix = clf.DMatrix(val_x , label=val_y) params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.04, 'tree_method': 'exact', 'seed': 2020, 'nthread': 36, "silent": True, } watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit) test_pred = model.predict(test_x , ntree_limit=model.best_ntree_limit) if clf_name == "cat": params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli', 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False} model = clf(iterations=20000, **params) model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500) val_pred = model.predict(val_x) test_pred = model.predict(test_x) train[valid_index] = val_pred test = test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test
def lgb_model(x_train, y_train, x_test): lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb") return lgb_train, lgb_test def xgb_model(x_train, y_train, x_test): xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb") return xgb_train, xgb_test def cat_model(x_train, y_train, x_test): cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
************************************ 1 ************************************ Training until validation scores don't improve for 200 rounds [200] training's auc: 0.749225 valid_1's auc: 0.729679 [400] training's auc: 0.765075 valid_1's auc: 0.730496 [600] training's auc: 0.778745 valid_1's auc: 0.730435 Early stopping, best iteration is: [455] training's auc: 0.769202 valid_1's auc: 0.730686 [0.7306859913754798] ************************************ 2 ************************************ Training until validation scores don't improve for 200 rounds [200] training's auc: 0.749221 valid_1's auc: 0.731315 [400] training's auc: 0.765117 valid_1's auc: 0.731658 [600] training's auc: 0.778542 valid_1's auc: 0.731333 Early stopping, best iteration is: [407] training's auc: 0.765671 valid_1's auc: 0.73173 [0.7306859913754798, 0.7317304414673989] ************************************ 3 ************************************ Training until validation scores don't improve for 200 rounds [200] training's auc: 0.748436 valid_1's auc: 0.732775 [400] training's auc: 0.764216 valid_1's auc: 0.733173 Early stopping, best iteration is: [386] training's auc: 0.763261 valid_1's auc: 0.733261 [0.7306859913754798, 0.7317304414673989, 0.7332610441015461] ************************************ 4 ************************************ Training until validation scores don't improve for 200 rounds [200] training's auc: 0.749631 valid_1's auc: 0.728327 [400] training's auc: 0.765139 valid_1's auc: 0.728845 Early stopping, best iteration is: [286] training's auc: 0.756978 valid_1's auc: 0.728976 [0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912] ************************************ 5 ************************************ Training until validation scores don't improve for 200 rounds [200] training's auc: 0.748414 valid_1's auc: 0.732727 [400] training's auc: 0.763727 valid_1's auc: 0.733531 [600] training's auc: 0.777489 valid_1's auc: 0.733566 Early stopping, best iteration is: [524] training's auc: 0.772372 valid_1's auc: 0.733772 [0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912, 0.7337723979789789] lgb_scotrainre_list: [0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912, 0.7337723979789789] lgb_score_mean: 0.7316851627208389 lgb_score_std: 0.0017424259863954693
testA_result = pd.read_csv('../testA_result.csv')
roc_auc_score(testA_result['isDefault'].values, lgb_test)
0.7290917729487896
3.4 总结
特征工程是机器学习,甚至是深度学习中最为重要的一部分,在实际应用中往往也是所花费时间最多的一步。各种算法书中对特征工程部分的讲解往往少得可怜,因为特征工程和具体的数据结合的太紧密,很难系统地覆盖所有场景。本章主要是通过一些常用的方法来做介绍,例如缺失值异常值的处理方法详细对任何数据集来说都是适用的。但对于分箱等操作本章给出了具体的几种思路,需要读者自己探索。在特征工程中比赛和具体的应用还是有所不同的,在实际的金融风控评分卡制作过程中,由于强调特征的可解释性,特征分箱尤其重要。