一、基于lightgbm的电池数据异常检测

1. 比赛介绍

汽车产业正在经历巨大变革，新能源汽车市场规模持续扩大，电池安全问题日益引发重视。电池异常检测面临着汽车数据质量差，检出率低，误报率高，大量无效报警无法直接自动化运维等问题。

为了更好的检验电池安全问题，比赛通过募集优秀异常检测方案，使用特征提取、参数优化、异常对比等手段对实车数据进行处理，优化异常检测结果，以便更好的应用于车辆预警、故障模式识别等多种场景。

新能源车辆电池的故障检测对于及时发现车辆问题、排除隐患、保护人的生命财产安全有着重要意义。新能源电池的故障是多种多样的，包括热失控、析锂、漏液等，本次比赛数据中包含了多种故障类型，但在数据中统一标注为故障标签“1”，不做进一步的区分。

一般故障检测都会面临故障标签少的问题，在本次比赛中，我们对数据进行了筛选，正常数据和故障数据的比例不是非常悬殊，即便如此，常规的异常检测算法依然会是一个非常好的选择。

电池的数据是一种时序数据，在数据中的‘timestamp’列是该数据的时间戳，理解时序数据的处理方式可能会对异常检测有更好的效果。除此之外，发生异常的数据可能看起来很“正常”，需要提取更多的特征来进行分析，点击下方链接即可参加比赛。 vLoong能源AI挑战赛

二、数据处理

1.数据解压缩

# ! unzip -qoa data/data168245/Train.zip
# ! unzip -qoa  data/data168245/Test_A.zip

import numpy as np
from glob import glob
import pandas as pd
import random
from tqdm import tqdm
from sklearn import metrics
import pickle
import warnings
warnings.filterwarnings("ignore")

2. 加载pkl文件

训练集的label存放在pkl里面，可以通过它并区分正常片段和异常片段注意需要输入训练集对应的路径

# 获取训练集列表
data_path='Train'#存放数据的路径
train_pkl_files = glob(data_path+'/*.pkl')
print("数据长度",len(train_pkl_files))
print(train_pkl_files[:3])

数据长度 28389
['Train/20291.pkl', 'Train/22017.pkl', 'Train/4823.pkl']

3.数据查看

文件类型为.pkl文件，每个pkl文件内容为元组形式，（data,metadata）；

data：形状为（256，8），每列数据对应特征[‘volt’,‘current’,‘soc’,‘max_single_volt’,‘min_single_volt’,‘max_temp’,‘min_temp’,‘timestamp’]
metadata：包含label和mileage信息，label标签中‘00’表示正常片段，‘10’表示异常片段。

file_19143=pd.read_pickle("Train/19143.pkl")
print(len(file_19143))
print(len(file_19143[0]))
print(len(file_19143[0][0]))
print(len(file_19143[1]))
print(file_19143[0])
print(file_19143[0][0])
print(file_19143[1])

2
256
8
2
[[ 1.562e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.004e+03]
 [ 1.562e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.005e+03]
 [ 1.562e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.007e+03]
 ...
 [ 1.565e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.552e+03]
 [ 1.565e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.562e+03]
 [ 1.565e+02 -4.600e+00  4.570e+01 ...  2.400e+02  2.220e+02  9.573e+03]]
[ 1.562e+02 -4.600e+00  4.570e+01  1.738e+00  1.731e+00  2.400e+02  2.220e+02  9.004e+03]
{'label': '10', 'mileage': 12820.5}

# 每个文件256条记录，但是基本都差不多，因此只选最后一条
file_19143[0][:,0:7][-1]

array([156.5  ,  -4.6  ,  45.7  ,   1.741,   1.735, 240.   , 222.   ])

4. 读取数据集

def  load_data(pkl_list,label=True):
    X = []
    y = []
    #car_list=[]
    for  each_pkl in pkl_list:
        pic = open(each_pkl,'rb')
        item= pickle.load(pic)
        X.append(item[0][:,0:7][-1])#取每个滑窗的最后一个
        if label:
            y.append(int(item[1]['label'][0]))
        else:
            y.append(0)
    X = np.vstack(X)
    y = np.vstack(y)
    return X, y

# 获取训练集列表
test_data_path='Test_A'#存放数据的路径
test_pkl_files = glob(test_data_path+'/*.pkl')
print("数据长度",len(test_pkl_files))
print(test_pkl_files[:3])

数据长度 6234
['Test_A/10463.pkl', 'Test_A/8487.pkl', 'Test_A/4933.pkl']

X_train,y_train=load_data(train_pkl_files)
X_test,y_test=load_data(test_pkl_files,label=False)
X_train.shape,X_test.shape

((28389, 7), (6234, 7))

print(X_train[0],y_train[0])

[162.1    -3.6    48.9     1.804   1.792 204.    186.   ] [0]

columns=['volt','current','soc','max_single_volt','min_single_volt','max_temp','min_temp']

with open('train.csv','w') as f:
    f.write('volt,current,soc,max_single_volt,min_single_volt,max_temp,min_temp,label\n')
    for i in range(28389):
        for j in range(7):
            f.write(str(X_train[i][j])+',')
        f.write(str(y_train[i][0]) +'\n')

with open('test.csv','w') as f:
    f.write('volt,current,soc,max_single_volt,min_single_volt,max_temp,min_temp,label\n')
    for i in range(6234):
        for j in range(7):
            f.write(str(X_test[i][j])+',')
        f.write(str(y_test[i][0]) +'\n')

三、lgb 预测

1.lightgbm安装

!pip install -q catboost

!pip install -q lightgbm

2.定义分类函数

import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
def train_model_classification(X, X_test, y, params, num_classes=2,
                               folds=None, model_type='lgb',
                               eval_metric='logloss', columns=None,
                               plot_feature_importance=False,
                               model=None, verbose=10000,
                               early_stopping_rounds=200,
                               splits=None, n_folds=3):
    """
    分类模型函数
    返回字典，包括： oof predictions, test predictions, scores and, if necessary, feature importances.
    :params: X - 训练数据， pd.DataFrame
    :params: X_test - 测试数据，pd.DataFrame
    :params: y - 目标
    :params: folds - folds to split data
    :params: model_type - 模型
    :params: eval_metric - 评价指标
    :params: columns - 特征列
    :params: plot_feature_importance - 是否展示特征重要性
    :params: model - sklearn model, works only for "sklearn" model type
    """
    start_time = time.time()
    global y_pred_valid, y_pred
    columns = X.columns if columns is None else columns
    X_test = X_test[columns]
    splits = folds.split(X, y) if splits is None else splits
    n_splits = folds.n_splits if splits is None else n_folds
    # to set up scoring parameters
    metrics_dict = {
        'logloss': {
            'lgb_metric_name': 'logloss',
            'xgb_metric_name': 'mlogloss',
            'catboost_metric_name': 'Logloss',
            'sklearn_scoring_function': metrics.log_loss
        },
        'lb_score_method': {
            'sklearn_scoring_f1': metrics.f1_score,  # 线上评价指标
            'sklearn_scoring_accuracy': metrics.accuracy_score,  # 线上评价指标
            'sklearn_scoring_auc': metrics.roc_auc_score
        },
    }
    result_dict = {}
    # out-of-fold predictions on train data
    oof = np.zeros(shape=(len(X), num_classes))
    # averaged predictions on train data
    prediction = np.zeros(shape=(len(X_test), num_classes))
    # list of scores on folds
    acc_scores = []
    scores = []
    # feature importance
    feature_importance = pd.DataFrame()
    # split and train on folds
    for fold_n, (train_index, valid_index) in enumerate(splits):
        if verbose:
            print(f'Fold {fold_n + 1} started at {time.ctime()}')
        if type(X) == np.ndarray:
            X_train, X_valid = X[train_index], X[valid_index]
            y_train, y_valid = y[train_index], y[valid_index]
        else:
            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
        if model_type == 'lgb':
            model = lgb.LGBMClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
                      verbose=verbose,
                      early_stopping_rounds=early_stopping_rounds)
            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)
        if model_type == 'xgb':
            model = xgb.XGBClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
                      verbose=bool(verbose),  # xgb verbose bool
                      early_stopping_rounds=early_stopping_rounds)
            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
        if model_type == 'sklearn':
            model = model
            model.fit(X_train, y_train)
            y_pred_valid = model.predict_proba(X_valid)
            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
            print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')
            y_pred = model.predict_proba(X_test)
        if model_type == 'cat':
            model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
                                       **params,
                                       loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
                      verbose=False)
            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test)
        oof[valid_index] = y_pred_valid
        # 评价指标
        acc_scores.append(
            metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
        scores.append(
            metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:, 1]))
        print(acc_scores)
        print(scores)
        prediction += y_pred
        if model_type == 'lgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
        if model_type == 'xgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
    prediction /= n_splits
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(acc_scores), np.std(acc_scores)))
    result_dict['oof'] = oof
    result_dict['prediction'] = prediction
    result_dict['acc_scores'] = acc_scores
    result_dict['scores'] = scores
    if model_type == 'lgb' or model_type == 'xgb':
        if plot_feature_importance:
            feature_importance["importance"] /= n_splits
            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False).index
            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
            best_features.to_csv('./feature_importance_lgb.csv', index=None)
            # plt.figure(figsize=(16, 12))
            # sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
            # plt.title('LGB Features (avg over folds)')
            # plt.show()
            result_dict['feature_importance'] = feature_importance
            # print(feature_importance)
    end_time = time.time()
    print("train_model_classification cost time:{}".format(end_time - start_time))
    return result_dict

3.设置参数

lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'n_estimators': 100000,
    'learning_rate': 0.1,
    'random_state': 2948,
    'bagging_freq': 8,
    'bagging_fraction': 0.80718,
    # 'bagging_seed': 11,
    'feature_fraction': 0.7,  # 0.3
    'feature_fraction_seed': 11,
    'max_depth': 9,
    'min_data_in_leaf': 40,
    'min_child_weight': 0.18654,
    "min_split_gain": 0.35079,
    'min_sum_hessian_in_leaf': 1.11347,
    'num_leaves': 29,
    'num_threads': 4,
    "lambda_l1": 0.55831,
    'lambda_l2': 1.67906,
    'cat_smooth': 10.4,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'n_jobs': -1,
    'metric': 'auc'
    # 'verbosity': 1,
}

4.读取数据

column=['volt','current','soc','max_single_volt','min_single_volt','max_temp','min_temp']

import pandas as pd
train=pd.read_csv("train.csv",header=0)
test=pd.read_csv("test.csv",header=0)

print(train.columns)

Index(['volt', 'current', 'soc', 'max_single_volt', 'min_single_volt',       'max_temp', 'min_temp', 'label'],
      dtype='object')

print(columns)

['volt', 'current', 'soc', 'max_single_volt', 'min_single_volt', 'max_temp', 'min_temp']

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	volt	current	soc	max_single_volt	min_single_volt	max_temp	min_temp	label
0	162.1	-3.6	48.9	1.804	1.792	204.0	186.0	0
1	164.5	-4.7	62.2	1.832	1.819	96.0	84.0	0
2	153.2	-18.9	33.7	1.706	1.700	144.0	132.0	0
3	159.2	-4.0	43.2	1.772	1.768	72.0	60.0	0
4	151.1	-43.3	24.1	1.684	1.672	180.0	162.0	1

train['label']

0        0
1        0
2        0
3        0
4        1
        ..
28384    0
28385    0
28386    0
28387    0
28388    0
Name: label, Length: 28389, dtype: int64

5.开始训练

n_fold = 10
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
result_dict_lgb = train_model_classification(X=train[columns],
                                             X_test=test[columns],
                                             y=train.label,
                                             params=lgb_params,
                                             num_classes=num_classes,
                                             folds=folds,
                                             model_type='lgb',
                                             eval_metric='logloss',
                                             plot_feature_importance=True,
                                             verbose=200,                                             
                                             n_folds=n_fold
                                             )

分类个数num_classes:2
Fold 1 started at Wed Oct  5 02:55:25 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.111209 training's auc: 0.989518  valid_1's binary_logloss: 0.153545  valid_1's auc: 0.972961
[400] training's binary_logloss: 0.0948993  training's auc: 0.993179  valid_1's binary_logloss: 0.142958  valid_1's auc: 0.976604
Early stopping, best iteration is:
[302] training's binary_logloss: 0.0949081  training's auc: 0.993177  valid_1's binary_logloss: 0.142938  valid_1's auc: 0.976624
[0.943994364212751]
[0.9766040709840477]
Fold 2 started at Wed Oct  5 02:55:26 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.112528 training's auc: 0.989363  valid_1's binary_logloss: 0.148536  valid_1's auc: 0.974635
[400] training's binary_logloss: 0.108259 training's auc: 0.990356  valid_1's binary_logloss: 0.145747  valid_1's auc: 0.975471
Early stopping, best iteration is:
[216] training's binary_logloss: 0.108885 training's auc: 0.990224  valid_1's binary_logloss: 0.145846  valid_1's auc: 0.975544
[0.943994364212751, 0.9468122578372666]
[0.9766040709840477, 0.9755442019729869]
Fold 3 started at Wed Oct  5 02:55:27 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.113221 training's auc: 0.989477  valid_1's binary_logloss: 0.135198  valid_1's auc: 0.981856
[400] training's binary_logloss: 0.0984195  training's auc: 0.992621  valid_1's binary_logloss: 0.12575 valid_1's auc: 0.984026
Early stopping, best iteration is:
[314] training's binary_logloss: 0.0984448  training's auc: 0.992615  valid_1's binary_logloss: 0.125713  valid_1's auc: 0.984046
[0.943994364212751, 0.9468122578372666, 0.9499823881648468]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199]
Fold 4 started at Wed Oct  5 02:55:29 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.113117 training's auc: 0.989493  valid_1's binary_logloss: 0.142585  valid_1's auc: 0.977954
[400] training's binary_logloss: 0.103629 training's auc: 0.991649  valid_1's binary_logloss: 0.137539  valid_1's auc: 0.979054
Early stopping, best iteration is:
[249] training's binary_logloss: 0.103959 training's auc: 0.991585  valid_1's binary_logloss: 0.137461  valid_1's auc: 0.979104
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114]
Fold 5 started at Wed Oct  5 02:55:30 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.114539 training's auc: 0.989179  valid_1's binary_logloss: 0.150543  valid_1's auc: 0.972895
[400] training's binary_logloss: 0.0951711  training's auc: 0.99347 valid_1's binary_logloss: 0.141725  valid_1's auc: 0.975349
Early stopping, best iteration is:
[316] training's binary_logloss: 0.0961998  training's auc: 0.993262  valid_1's binary_logloss: 0.141733  valid_1's auc: 0.97543
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343]
Fold 6 started at Wed Oct  5 02:55:31 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.113193 training's auc: 0.989564  valid_1's binary_logloss: 0.140258  valid_1's auc: 0.977863
[400] training's binary_logloss: 0.0941066  training's auc: 0.993575  valid_1's binary_logloss: 0.130028  valid_1's auc: 0.980503
Early stopping, best iteration is:
[330] training's binary_logloss: 0.0942212  training's auc: 0.993544  valid_1's binary_logloss: 0.129913  valid_1's auc: 0.980509
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711]
Fold 7 started at Wed Oct  5 02:55:32 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.114392 training's auc: 0.989254  valid_1's binary_logloss: 0.141363  valid_1's auc: 0.977499
[400] training's binary_logloss: 0.104329 training's auc: 0.991732  valid_1's binary_logloss: 0.137656  valid_1's auc: 0.978442
Early stopping, best iteration is:
[268] training's binary_logloss: 0.104419 training's auc: 0.991717  valid_1's binary_logloss: 0.137522  valid_1's auc: 0.978487
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592]
Fold 8 started at Wed Oct  5 02:55:34 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.112407 training's auc: 0.989382  valid_1's binary_logloss: 0.144779  valid_1's auc: 0.976164
[400] training's binary_logloss: 0.101891 training's auc: 0.991921  valid_1's binary_logloss: 0.139636  valid_1's auc: 0.977802
Early stopping, best iteration is:
[262] training's binary_logloss: 0.101891 training's auc: 0.991921  valid_1's binary_logloss: 0.139636  valid_1's auc: 0.977802
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432]
Fold 9 started at Wed Oct  5 02:55:35 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.113429 training's auc: 0.989124  valid_1's binary_logloss: 0.146888  valid_1's auc: 0.974755
[400] training's binary_logloss: 0.101734 training's auc: 0.991863  valid_1's binary_logloss: 0.139987  valid_1's auc: 0.976694
Early stopping, best iteration is:
[283] training's binary_logloss: 0.10179  training's auc: 0.991847  valid_1's binary_logloss: 0.139981  valid_1's auc: 0.976687
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823, 0.9496301514617823]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432, 0.9766869412507302]
Fold 10 started at Wed Oct  5 02:55:36 2022
[LightGBM] [Warning] feature_fraction is set=0.7, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.7
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.113542 training's auc: 0.989398  valid_1's binary_logloss: 0.135824  valid_1's auc: 0.980073
[400] training's binary_logloss: 0.101569 training's auc: 0.992151  valid_1's binary_logloss: 0.128997  valid_1's auc: 0.98184
Early stopping, best iteration is:
[259] training's binary_logloss: 0.10162  training's auc: 0.992135  valid_1's binary_logloss: 0.128966  valid_1's auc: 0.981838
[0.943994364212751, 0.9468122578372666, 0.9499823881648468, 0.9461077844311377, 0.9482212046495245, 0.952448045086298, 0.9443466009158155, 0.9496301514617823, 0.9496301514617823, 0.9513742071881607]
[0.9766040709840477, 0.9755442019729869, 0.9840294951581199, 0.9791042748050114, 0.9754300622333343, 0.9804958863031711, 0.9784286383473592, 0.977801952943432, 0.9766869412507302, 0.9818377898309386]
CV mean score: 0.9786, std: 0.0027.
CV mean score: 0.9483, std: 0.0027.
train_model_classification cost time:12.621715545654297

6.保存预测结果

filenames=[]
for filename in test_pkl_files:
    filename=filename.split('/')[-1]
    filenames.append(filename)

acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
result=pd.DataFrame({'file_name':filenames,'score': result_dict_lgb['prediction'][:, 1]})#列名必须为这俩个
result.to_csv('lgb.csv', index=False)
# 保存概率文件
pd.DataFrame(result_dict_lgb['oof']).to_csv('lgb_acc{}auc{}trainoof.csv'.format(acc_score, score),
                                            index=False, header=False)
pd.DataFrame(result_dict_lgb['prediction']).to_csv('lgb_acc{}auc{}testoof.csv'.format(acc_score, score), index=False, header=False)

0.978596331382913

四、提交

提交结果如下：

计划下步采取LightGBM、XGBoost、Catoost 模型融合

基于lightgbm的vLoong电池数据异常检测