一、基于lightgbm的电池数据异常检测
1. 比赛介绍
汽车产业正在经历巨大变革,新能源汽车市场规模持续扩大,电池安全问题日益引发重视。 电池异常检测面临着汽车数据质量差,检出率低,误报率高,大量无效报警无法直接自动化运维等问题。
为了更好的检验电池安全问题,比赛通过募集优秀异常检测方案,使用特征提取、参数优化、异常对比等手段对实车数据进行处理,优化异常检测结果,以便更好的应用于车辆预警、故障模式识别等多种场景。
新能源车辆电池的故障检测对于及时发现车辆问题、排除隐患、保护人的生命财产安全有着重要意义。新能源电池的故障是多种多样的,包括热失控、析锂、漏液等,本次比赛数据中包含了多种故障类型,但在数据中统一标注为故障标签“1”,不做进一步的区分。
一般故障检测都会面临故障标签少的问题,在本次比赛中,我们对数据进行了筛选,正常数据和故障数据的比例不是非常悬殊,即便如此,常规的异常检测算法依然会是一个非常好的选择。
电池的数据是一种时序数据,在数据中的‘timestamp’列是该数据的时间戳,理解时序数据的处理方式可能会对异常检测有更好的效果。除此之外,发生异常的数据可能看起来很“正常”,需要提取更多的特征来进行分析,点击下方链接即可参加比赛。 vLoong能源AI挑战赛
二、数据处理
1.数据解压缩
# ! unzip -qoa data/data168245/Train.zip # ! unzip -qoa data/data168245/Test_A.zip
import numpy as np from glob import glob import pandas as pd import random from tqdm import tqdm from sklearn import metrics import pickle import warnings warnings.filterwarnings("ignore")
2. 加载pkl文件
训练集的label存放在pkl里面,可以通过它并区分正常片段和异常片段 注意需要输入训练集对应的路径
# 获取训练集列表 data_path='Train'#存放数据的路径 train_pkl_files = glob(data_path+'/*.pkl') print("数据长度",len(train_pkl_files)) print(train_pkl_files[:3])
数据长度 28389 ['Train/20291.pkl', 'Train/22017.pkl', 'Train/4823.pkl']
3.数据查看
文件类型为.pkl文件,每个pkl文件内容为元组形式,(data,metadata);
- data:形状为(256,8),每列数据对应特征[‘volt’,‘current’,‘soc’,‘max_single_volt’,‘min_single_volt’,‘max_temp’,‘min_temp’,‘timestamp’]
- metadata:包含label和mileage信息,label标签中‘00’表示正常片段,‘10’表示异常片段。
file_19143=pd.read_pickle("Train/19143.pkl") print(len(file_19143)) print(len(file_19143[0])) print(len(file_19143[0][0])) print(len(file_19143[1])) print(file_19143[0]) print(file_19143[0][0]) print(file_19143[1])
2 256 8 2 [[ 1.562e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.004e+03] [ 1.562e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.005e+03] [ 1.562e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.007e+03] ... [ 1.565e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.552e+03] [ 1.565e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.562e+03] [ 1.565e+02 -4.600e+00 4.570e+01 ... 2.400e+02 2.220e+02 9.573e+03]] [ 1.562e+02 -4.600e+00 4.570e+01 1.738e+00 1.731e+00 2.400e+02 2.220e+02 9.004e+03] {'label': '10', 'mileage': 12820.5}
# 每个文件256条记录,但是基本都差不多,因此只选最后一条 file_19143[0][:,0:7][-1]
array([156.5 , -4.6 , 45.7 , 1.741, 1.735, 240. , 222. ])
4. 读取数据集
def load_data(pkl_list,label=True): X = [] y = [] #car_list=[] for each_pkl in pkl_list: pic = open(each_pkl,'rb') item= pickle.load(pic) X.append(item[0][:,0:7][-1])#取每个滑窗的最后一个 if label: y.append(int(item[1]['label'][0])) else: y.append(0) X = np.vstack(X) y = np.vstack(y) return X, y
# 获取训练集列表 test_data_path='Test_A'#存放数据的路径 test_pkl_files = glob(test_data_path+'/*.pkl') print("数据长度",len(test_pkl_files)) print(test_pkl_files[:3])
数据长度 6234 ['Test_A/10463.pkl', 'Test_A/8487.pkl', 'Test_A/4933.pkl']
X_train,y_train=load_data(train_pkl_files) X_test,y_test=load_data(test_pkl_files,label=False) X_train.shape,X_test.shape
((28389, 7), (6234, 7))
print(X_train[0],y_train[0])
[162.1 -3.6 48.9 1.804 1.792 204. 186. ] [0]
columns=['volt','current','soc','max_single_volt','min_single_volt','max_temp','min_temp']
with open('train.csv','w') as f: f.write('volt,current,soc,max_single_volt,min_single_volt,max_temp,min_temp,label\n') for i in range(28389): for j in range(7): f.write(str(X_train[i][j])+',') f.write(str(y_train[i][0]) +'\n')
with open('test.csv','w') as f: f.write('volt,current,soc,max_single_volt,min_single_volt,max_temp,min_temp,label\n') for i in range(6234): for j in range(7): f.write(str(X_test[i][j])+',') f.write(str(y_test[i][0]) +'\n')
三、lgb 预测
1.lightgbm安装
!pip install -q catboost
!pip install -q lightgbm
2.定义分类函数
import time import lightgbm as lgb import matplotlib.pyplot as plt import numpy as np import pandas as pd import xgboost as xgb from catboost import CatBoostClassifier from sklearn import metrics from sklearn.model_selection import StratifiedKFold def train_model_classification(X, X_test, y, params, num_classes=2, folds=None, model_type='lgb', eval_metric='logloss', columns=None, plot_feature_importance=False, model=None, verbose=10000, early_stopping_rounds=200, splits=None, n_folds=3): """ 分类模型函数 返回字典,包括: oof predictions, test predictions, scores and, if necessary, feature importances. :params: X - 训练数据, pd.DataFrame :params: X_test - 测试数据,pd.DataFrame :params: y - 目标 :params: folds - folds to split data :params: model_type - 模型 :params: eval_metric - 评价指标 :params: columns - 特征列 :params: plot_feature_importance - 是否展示特征重要性 :params: model - sklearn model, works only for "sklearn" model type """ start_time = time.time() global y_pred_valid, y_pred columns = X.columns if columns is None else columns X_test = X_test[columns] splits = folds.split(X, y) if splits is None else splits n_splits = folds.n_splits if splits is None else n_folds # to set up scoring parameters metrics_dict = { 'logloss': { 'lgb_metric_name': 'logloss', 'xgb_metric_name': 'mlogloss', 'catboost_metric_name': 'Logloss', 'sklearn_scoring_function': metrics.log_loss }, 'lb_score_method': { 'sklearn_scoring_f1': metrics.f1_score, # 线上评价指标 'sklearn_scoring_accuracy': metrics.accuracy_score, # 线上评价指标 'sklearn_scoring_auc': metrics.roc_auc_score }, } result_dict = {} # out-of-fold predictions on train data oof = np.zeros(shape=(len(X), num_classes)) # averaged predictions on train data prediction = np.zeros(shape=(len(X_test), num_classes)) # list of scores on folds acc_scores = [] scores = [] # feature importance feature_importance = pd.DataFrame() # split and train on folds for fold_n, (train_index, valid_index) in enumerate(splits): if verbose: print(f'Fold {fold_n + 1} started at {time.ctime()}') if type(X) == np.ndarray: X_train, X_valid = X[train_index], X[valid_index] y_train, y_valid = y[train_index], y[valid_index] else: X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index] y_train, y_valid = y.iloc[train_index], y.iloc[valid_index] if model_type == 'lgb': model = lgb.LGBMClassifier(**params) model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric=metrics_dict[eval_metric]['lgb_metric_name'], verbose=verbose, early_stopping_rounds=early_stopping_rounds) y_pred_valid = model.predict_proba(X_valid) y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_) if model_type == 'xgb': model = xgb.XGBClassifier(**params) model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric=metrics_dict[eval_metric]['xgb_metric_name'], verbose=bool(verbose), # xgb verbose bool early_stopping_rounds=early_stopping_rounds) y_pred_valid = model.predict_proba(X_valid) y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit) if model_type == 'sklearn': model = model model.fit(X_train, y_train) y_pred_valid = model.predict_proba(X_valid) score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid) print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.') y_pred = model.predict_proba(X_test) if model_type == 'cat': model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'], **params, loss_function=metrics_dict[eval_metric]['catboost_metric_name']) model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False) y_pred_valid = model.predict_proba(X_valid) y_pred = model.predict_proba(X_test) oof[valid_index] = y_pred_valid # 评价指标 acc_scores.append( metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1))) scores.append( metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:, 1])) print(acc_scores) print(scores) prediction += y_pred if model_type == 'lgb' and plot_feature_importance: # feature importance fold_importance = pd.DataFrame() fold_importance["feature"] = columns fold_importance["importance"] = model.feature_importances_ fold_importance["fold"] = fold_n + 1 feature_importance = pd.concat([feature_importance, fold_importance], axis=0) if model_type == 'xgb' and plot_feature_importance: # feature importance fold_importance = pd.DataFrame() fold_importance["feature"] = columns fold_importance["importance"] = model.feature_importances_ fold_importance["fold"] = fold_n + 1 feature_importance = pd.concat([feature_importance, fold_importance], axis=0) prediction /= n_splits print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores))) print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(acc_scores), np.std(acc_scores))) result_dict['oof'] = oof result_dict['prediction'] = prediction result_dict['acc_scores'] = acc_scores result_dict['scores'] = scores if model_type == 'lgb' or model_type == 'xgb': if plot_feature_importance: feature_importance["importance"] /= n_splits cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values( by="importance", ascending=False).index best_features = feature_importance.loc[feature_importance.feature.isin(cols)] best_features.to_csv('./feature_importance_lgb.csv', index=None) # plt.figure(figsize=(16, 12)) # sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False)) # plt.title('LGB Features (avg over folds)') # plt.show() result_dict['feature_importance'] = feature_importance # print(feature_importance) end_time = time.time() print("train_model_classification cost time:{}".format(end_time - start_time)) return result_dict
3.设置参数
lgb_params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'n_estimators': 100000, 'learning_rate': 0.1, 'random_state': 2948, 'bagging_freq': 8, 'bagging_fraction': 0.80718, # 'bagging_seed': 11, 'feature_fraction': 0.7, # 0.3 'feature_fraction_seed': 11, 'max_depth': 9, 'min_data_in_leaf': 40, 'min_child_weight': 0.18654, "min_split_gain": 0.35079, 'min_sum_hessian_in_leaf': 1.11347, 'num_leaves': 29, 'num_threads': 4, "lambda_l1": 0.55831, 'lambda_l2': 1.67906, 'cat_smooth': 10.4, 'subsample': 0.7, 'colsample_bytree': 0.7, 'n_jobs': -1, 'metric': 'auc' # 'verbosity': 1, }
4.读取数据
column=['volt','current','soc','max_single_volt','min_single_volt','max_temp','min_temp']
import pandas as pd train=pd.read_csv("train.csv",header=0) test=pd.read_csv("test.csv",header=0)
print(train.columns)
Index(['volt', 'current', 'soc', 'max_single_volt', 'min_single_volt', 'max_temp', 'min_temp', 'label'], dtype='object')
print(columns)
['volt', 'current', 'soc', 'max_single_volt', 'min_single_volt', 'max_temp', 'min_temp']
train.head()
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
volt | current | soc | max_single_volt | min_single_volt | max_temp | min_temp | label | |
0 | 162.1 | -3.6 | 48.9 | 1.804 | 1.792 | 204.0 | 186.0 | 0 |
1 | 164.5 | -4.7 | 62.2 | 1.832 | 1.819 | 96.0 | 84.0 | 0 |
2 | 153.2 | -18.9 | 33.7 | 1.706 | 1.700 | 144.0 | 132.0 | 0 |
3 | 159.2 | -4.0 | 43.2 | 1.772 | 1.768 | 72.0 | 60.0 | 0 |
4 | 151.1 | -43.3 | 24.1 | 1.684 | 1.672 | 180.0 | 162.0 | 1 |
train['label']
0 0 1 0 2 0 3 0 4 1 .. 28384 0 28385 0 28386 0 28387 0 28388 0 Name: label, Length: 28389, dtype: int64
5.开始训练
n_fold = 10 num_classes = 2 print("分类个数num_classes:{}".format(num_classes)) folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True) result_dict_lgb = train_model_classification(X=train[columns], X_test=test[columns], y=train.label, params=lgb_params, num_classes=num_classes, folds=folds, model_type='lgb', eval_metric='logloss', plot_feature_importance=True, verbose=200, n_folds=n_fold )
分类个数num_classes:2 Fold 1 started at Wed Oct 5 02:55:25 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.111209 training's auc: 0.989518 valid_1's binary_logloss: 0.153545 valid_1's auc: 0.972961 [400] training's binary_logloss: 0.0948993 training's auc: 0.993179 valid_1's binary_logloss: 0.142958 valid_1's auc: 0.976604 Early stopping, best iteration is: [302] training's binary_logloss: 0.0949081 training's auc: 0.993177 valid_1's binary_logloss: 0.142938 valid_1's auc: 0.976624 [0.943994364212751] [0.9766040709840477] Fold 2 started at Wed Oct 5 02:55:26 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.112528 training's auc: 0.989363 valid_1's binary_logloss: 0.148536 valid_1's auc: 0.974635 [400] training's binary_logloss: 0.108259 training's auc: 0.990356 valid_1's binary_logloss: 0.145747 valid_1's auc: 0.975471 Early stopping, best iteration is: [216] training's binary_logloss: 0.108885 training's auc: 0.990224 valid_1's binary_logloss: 0.145846 valid_1's auc: 0.975544 [0.943994364212751, 0.9468122578372666] [0.9766040709840477, 0.9755442019729869] Fold 3 started at Wed Oct 5 02:55:27 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.113221 training's auc: 0.989477 valid_1's binary_logloss: 0.135198 valid_1's auc: 0.981856 [400] training's binary_logloss: 0.0984195 training's auc: 0.992621 valid_1's binary_logloss: 0.12575 valid_1's auc: 0.984026 Early stopping, best iteration is: [314] training's binary_logloss: 0.0984448 training's auc: 0.992615 valid_1's binary_logloss: 0.125713 valid_1's auc: 0.984046 [0.943994364212751, 0.9468122578372666, 0.9499823881648468] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199] Fold 4 started at Wed Oct 5 02:55:29 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.113117 training's auc: 0.989493 valid_1's binary_logloss: 0.142585 valid_1's auc: 0.977954 [400] training's binary_logloss: 0.103629 training's auc: 0.991649 valid_1's binary_logloss: 0.137539 valid_1's auc: 0.979054 Early stopping, best iteration is: [249] training's binary_logloss: 0.103959 training's auc: 0.991585 valid_1's binary_logloss: 0.137461 valid_1's auc: 0.979104 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114] Fold 5 started at Wed Oct 5 02:55:30 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.114539 training's auc: 0.989179 valid_1's binary_logloss: 0.150543 valid_1's auc: 0.972895 [400] training's binary_logloss: 0.0951711 training's auc: 0.99347 valid_1's binary_logloss: 0.141725 valid_1's auc: 0.975349 Early stopping, best iteration is: [316] training's binary_logloss: 0.0961998 training's auc: 0.993262 valid_1's binary_logloss: 0.141733 valid_1's auc: 0.97543 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343] Fold 6 started at Wed Oct 5 02:55:31 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.113193 training's auc: 0.989564 valid_1's binary_logloss: 0.140258 valid_1's auc: 0.977863 [400] training's binary_logloss: 0.0941066 training's auc: 0.993575 valid_1's binary_logloss: 0.130028 valid_1's auc: 0.980503 Early stopping, best iteration is: [330] training's binary_logloss: 0.0942212 training's auc: 0.993544 valid_1's binary_logloss: 0.129913 valid_1's auc: 0.980509 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711] Fold 7 started at Wed Oct 5 02:55:32 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.114392 training's auc: 0.989254 valid_1's binary_logloss: 0.141363 valid_1's auc: 0.977499 [400] training's binary_logloss: 0.104329 training's auc: 0.991732 valid_1's binary_logloss: 0.137656 valid_1's auc: 0.978442 Early stopping, best iteration is: [268] training's binary_logloss: 0.104419 training's auc: 0.991717 valid_1's binary_logloss: 0.137522 valid_1's auc: 0.978487 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592] Fold 8 started at Wed Oct 5 02:55:34 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.112407 training's auc: 0.989382 valid_1's binary_logloss: 0.144779 valid_1's auc: 0.976164 [400] training's binary_logloss: 0.101891 training's auc: 0.991921 valid_1's binary_logloss: 0.139636 valid_1's auc: 0.977802 Early stopping, best iteration is: [262] training's binary_logloss: 0.101891 training's auc: 0.991921 valid_1's binary_logloss: 0.139636 valid_1's auc: 0.977802 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432] Fold 9 started at Wed Oct 5 02:55:35 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.113429 training's auc: 0.989124 valid_1's binary_logloss: 0.146888 valid_1's auc: 0.974755 [400] training's binary_logloss: 0.101734 training's auc: 0.991863 valid_1's binary_logloss: 0.139987 valid_1's auc: 0.976694 Early stopping, best iteration is: [283] training's binary_logloss: 0.10179 training's auc: 0.991847 valid_1's binary_logloss: 0.139981 valid_1's auc: 0.976687 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823, 0.9496301514617823] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432, 0.9766869412507302] Fold 10 started at Wed Oct 5 02:55:36 2022 [LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7 [LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831 [LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4 [LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718 [LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347 [LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40 [LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8 [LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906 Training until validation scores don't improve for 200 rounds [200] training's binary_logloss: 0.113542 training's auc: 0.989398 valid_1's binary_logloss: 0.135824 valid_1's auc: 0.980073 [400] training's binary_logloss: 0.101569 training's auc: 0.992151 valid_1's binary_logloss: 0.128997 valid_1's auc: 0.98184 Early stopping, best iteration is: [259] training's binary_logloss: 0.10162 training's auc: 0.992135 valid_1's binary_logloss: 0.128966 valid_1's auc: 0.981838 [0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823, 0.9496301514617823, 0.9513742071881607] [0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432, 0.9766869412507302, 0.9818377898309386] CV mean score: 0.9786, std: 0.0027. CV mean score: 0.9483, std: 0.0027. train_model_classification cost time:12.621715545654297
6.保存预测结果
filenames=[] for filename in test_pkl_files: filename=filename.split('/')[-1] filenames.append(filename)
acc_score = np.mean(result_dict_lgb['acc_scores']) score = np.mean(result_dict_lgb['scores']) print(score) result=pd.DataFrame({'file_name':filenames,'score': result_dict_lgb['prediction'][:, 1]})#列名必须为这俩个 result.to_csv('lgb.csv', index=False) # 保存概率文件 pd.DataFrame(result_dict_lgb['oof']).to_csv('lgb_acc{}auc{}trainoof.csv'.format(acc_score, score), index=False, header=False) pd.DataFrame(result_dict_lgb['prediction']).to_csv('lgb_acc{}auc{}testoof.csv'.format(acc_score, score), index=False, header=False)
0.978596331382913
四、提交
提交结果如下:
计划下步采取LightGBM、XGBoost、Catoost 模型融合