文章描述
- 数据分析:查看变量间相关性以及找出关键变量。
机器学习实战 —— 工业蒸汽量预测(一) - 数据特征工程对数据精进:异常值处理、归一化处理以及特征降维。
机器学习实战 —— 工业蒸汽量预测(二) - 模型训练(涉及主流ML模型):决策树、随机森林,lightgbm等。
机器学习实战 —— 工业蒸汽量预测(三) - 模型验证:评估指标以及交叉验证等。
机器学习实战 —— 工业蒸汽量预测(四) - 特征优化:用lgb对特征进行优化。
机器学习实战 —— 工业蒸汽量预测(五) - 模型融合:进行基于stacking方式模型融合。
机器学习实战 —— 工业蒸汽量预测(六)
背景描述
- 背景介绍
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
- 相关描述
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。
- 结果评估
预测结果以mean square error作为评判标准。
数据说明
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的MSE(mean square error)。
数据来源
http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
实战内容
4.模型验证
4.1模型评估的概念与正则化
4.1.1 过拟合与欠拟合
获取并绘制数据集
使用线性回归拟合数据
from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) lin_reg.score(X, y)
准确率为 0.495,比较低,直线拟合数据的程度较低。
使用均方误差判断拟合程度
from sklearn.metrics import mean_squared_error y_predict = lin_reg.predict(X) mean_squared_error(y, y_predict)
绘制拟合结果
4.1.2 回归模型的评估指标和调用方法
使用多项式回归拟合
- 封装 Pipeline 管道
from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import StandardScaler def PolynomialRegression(degree): return Pipeline([ ('poly', PolynomialFeatures(degree=degree)), ('std_scaler', StandardScaler()), ('lin_reg', LinearRegression()) ])
使用 Pipeline 拟合数据:degree = 2
poly2_reg = PolynomialRegression(degree=2) poly2_reg.fit(X, y) y2_predict = poly2_reg.predict(X) # 比较真值和预测值的均方误差 mean_squared_error(y, y2_predict)
绘制拟合结果
plt.scatter(x, y) plt.plot(np.sort(x), y2_predict[np.argsort(x)], color='r') plt.show()
调整 degree = 10
poly10_reg = PolynomialRegression(degree=10) poly10_reg.fit(X, y) y10_predict = poly10_reg.predict(X) mean_squared_error(y, y10_predict) # 1.0508466763764164 plt.scatter(x, y) plt.plot(np.sort(x), y10_predict[np.argsort(x)], color='r') plt.show()
调整 degree = 100
poly100_reg = PolynomialRegression(degree=100) poly100_reg.fit(X, y) y100_predict = poly100_reg.predict(X) mean_squared_error(y, y100_predict) # 0.6874357783433694 plt.scatter(x, y) plt.plot(np.sort(x), y100_predict[np.argsort(x)], color='r') plt.show()
分析
degree=2:均方误差为 1.0987392142417856;
degree=10:均方误差为 1.0508466763764164;
degree=100:均方误差为 0.6874357783433694;
degree 越大拟合的效果越好,因为样本点是一定的,我们总能找到一条曲线将所有的样本点拟合,也就是说将所有的样本点都完全落在这根曲线上,使得整体的均方误差为 0;
红色曲线并不是所计算出的拟合曲线,而此红色曲线只是原有的数据点对应的 y 的预测值连接出来的结果,而且有的地方没有数据点,因此连接的结果和原来的曲线不一样;
4.1.3 交叉验证
交叉验证迭代器
K折交叉验证: KFold 将所有的样例划分为 k 个组,称为折叠 (fold) (如果 k = n, 这等价于 Leave One Out(留一) 策略),都具有相同的大小(如果可能)。预测函数学习时使用 k - 1 个折叠中的数据,最后一个剩下的折叠会用于测试。
K折重复多次: RepeatedKFold 重复 K-Fold n 次。当需要运行时可以使用它 KFold n 次,在每次重复中产生不同的分割。
留一交叉验证: LeaveOneOut (或 LOO) 是一个简单的交叉验证。每个学习集都是通过除了一个样本以外的所有样本创建的,测试集是被留下的样本。 因此,对于 n 个样本,我们有 n 个不同的训练集和 n 个不同的测试集。这种交叉验证程序不会浪费太多数据,因为只有一个样本是从训练集中删除掉的:
留P交叉验证: LeavePOut 与 LeaveOneOut 非常相似,因为它通过从整个集合中删除 p 个样本来创建所有可能的 训练/测试集。对于 n 个样本,这产生了 {n \choose p} 个 训练-测试 对。与 LeaveOneOut 和 KFold 不同,当 p > 1 时,测试集会重叠。
用户自定义数据集划分: ShuffleSplit 迭代器将会生成一个用户给定数量的独立的训练/测试数据划分。样例首先被打散然后划分为一对训练测试集合。
设置每次生成的随机数相同: 可以通过设定明确的 random_state ,使得伪随机生成器的结果可以重复。
基于类标签、具有分层的交叉验证迭代器
如何解决样本不平衡问题? 使用StratifiedKFold和StratifiedShuffleSplit 分层抽样。 一些分类问题在目标类别的分布上可能表现出很大的不平衡性:例如,可能会出现比正样本多数倍的负样本。在这种情况下,建议采用如 StratifiedKFold 和 StratifiedShuffleSplit 中实现的分层抽样方法,确保相对的类别频率在每个训练和验证 折叠 中大致保留。
StratifiedKFold是 k-fold 的变种,会返回 stratified(分层) 的折叠:每个小集合中, 各个类别的样例比例大致和完整数据集中相同。
StratifiedShuffleSplit是 ShuffleSplit 的一个变种,会返回直接的划分,比如: 创建一个划分,但是划分中每个类的比例和完整数据集中的相同。
用于分组数据的交叉验证迭代器
如何进一步测试模型的泛化能力? 留出一组特定的不属于测试集和训练集的数据。有时我们想知道在一组特定的 groups 上训练的模型是否能很好地适用于看不见的 group 。为了衡量这一点,我们需要确保验证对象中的所有样本来自配对训练折叠中完全没有表示的组。
GroupKFold是 k-fold 的变体,它确保同一个 group 在测试和训练集中都不被表示。 例如,如果数据是从不同的 subjects 获得的,每个 subject 有多个样本,并且如果模型足够灵活以高度人物指定的特征中学习,则可能无法推广到新的 subject 。 GroupKFold 可以检测到这种过拟合的情况。
LeaveOneGroupOut是一个交叉验证方案,它根据第三方提供的 array of integer groups (整数组的数组)来提供样本。这个组信息可以用来编码任意域特定的预定义交叉验证折叠。
每个训练集都是由除特定组别以外的所有样本构成的。
LeavePGroupsOut类似于 LeaveOneGroupOut ,但为每个训练/测试集删除与 P 组有关的样本。
GroupShuffleSplit迭代器是 ShuffleSplit 和 LeavePGroupsOut 的组合,它生成一个随机划分分区的序列,其中为每个分组提供了一个组子集。
时间序列分割
TimeSeriesSplit是 k-fold 的一个变体,它首先返回 k 折作为训练数据集,并且 (k+1) 折作为测试数据集。 请注意,与标准的交叉验证方法不同,连续的训练集是超越前者的超集。 另外,它将所有的剩余数据添加到第一个训练分区,它总是用来训练模型。
4.2 网格搜索
Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。
4.2.1 简单的网格搜索
from sklearn.datasets import load_iris from sklearn.svm import SVC from sklearn.model_selection import train_test_split iris = load_iris() X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0) print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0])) #### grid search start best_score = 0 for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C)#对于每种参数可能的组合,进行一次训练; svm.fit(X_train,y_train) score = svm.score(X_test,y_test) if score > best_score:#找到表现最好的参数 best_score = score best_parameters = {'gamma':gamma,'C':C} #### grid search end print("Best score:{:.2f}".format(best_score)) print("Best parameters:{}".format(best_parameters))
4.2.2 Grid Search with Cross Validation(具有交叉验证的网格搜索)
X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0) X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=1) print("Size of training set:{} size of validation set:{} size of testing set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0])) best_score = 0.0 for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C) svm.fit(X_train,y_train) score = svm.score(X_val,y_val) if score > best_score: best_score = score best_parameters = {'gamma':gamma,'C':C} svm = SVC(**best_parameters) #使用最佳参数,构建新的模型 svm.fit(X_trainval,y_trainval) #使用训练集和验证集进行训练,more data always results in good performance. test_score = svm.score(X_test,y_test) # evaluation模型评估 print("Best score on validation set:{:.2f}".format(best_score)) print("Best parameters:{}".format(best_parameters)) print("Best score on test set:{:.2f}".format(test_score))
from sklearn.model_selection import cross_val_score best_score = 0.0 for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C) scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5折交叉验证 score = scores.mean() #取平均数 if score > best_score: best_score = score best_parameters = {"gamma":gamma,"C":C} svm = SVC(**best_parameters) svm.fit(X_trainval,y_trainval) test_score = svm.score(X_test,y_test) print("Best score on validation set:{:.2f}".format(best_score)) print("Best parameters:{}".format(best_parameters)) print("Score on testing set:{:.2f}".format(test_score))
交叉验证经常与网格搜索进行结合,作为参数评价的一种方法,这种方法叫做grid search with cross validation。sklearn因此设计了一个这样的类GridSearchCV,这个类实现了fit,predict,score等方法,被当做了一个estimator,使用fit方法,该过程中:(1)搜索到最佳参数;(2)实例化了一个最佳参数的estimator;
from sklearn.model_selection import GridSearchCV #把要调整的参数以及其候选值 列出来; param_grid = {"gamma":[0.001,0.01,0.1,1,10,100], "C":[0.001,0.01,0.1,1,10,100]} print("Parameters:{}".format(param_grid)) grid_search = GridSearchCV(SVC(),param_grid,cv=5) #实例化一个GridSearchCV类 X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10) grid_search.fit(X_train,y_train) #训练,找到最优的参数,同时使用最优的参数实例化一个新的SVC estimator。 print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test))) print("Best parameters:{}".format(grid_search.best_params_)) print("Best score on train set:{:.2f}".format(grid_search.best_score_))
4.2.3 学习曲线
import numpy as np import matplotlib.pyplot as plt from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.datasets import load_digits from sklearn.model_selection import learning_curve from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plt
4.2.4 验证曲线
4.3 工业蒸汽赛题模型验证
4.3.1 模型过拟合与欠拟合
from sklearn.decomposition import PCA #主成分分析法 #PCA方法降维 #保留16个主成分 pca = PCA(n_components=16) new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1]) new_test_pca_16 = pca.transform(test_data_scaler) new_train_pca_16 = pd.DataFrame(new_train_pca_16) new_test_pca_16 = pd.DataFrame(new_test_pca_16) new_train_pca_16['target'] = train_data_scaler['target']
#采用 pca 保留16维特征的数据 new_train_pca_16 = new_train_pca_16.fillna(0) train = new_train_pca_16[new_test_pca_16.columns] target = new_train_pca_16['target'] # 切分数据 训练数据80% 验证数据20% train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
#### 欠拟合 clf = SGDRegressor(max_iter=500, tol=1e-2) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
### 过拟合 from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(5) train_data_poly = poly.fit_transform(train_data) test_data_poly = poly.transform(test_data) clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data_poly, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data_poly)) score_test = mean_squared_error(test_target, clf.predict(test_data_poly)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
### 正常拟合 from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(3) train_data_poly = poly.fit_transform(train_data) test_data_poly = poly.transform(test_data) clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data_poly, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data_poly)) score_test = mean_squared_error(test_target, clf.predict(test_data_poly)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
4.3.2 模型正则化
L2范数正则化
poly = PolynomialFeatures(3) train_data_poly = poly.fit_transform(train_data) test_data_poly = poly.transform(test_data) clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'l2', alpha=0.0001) clf.fit(train_data_poly, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data_poly)) score_test = mean_squared_error(test_target, clf.predict(test_data_poly)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
L1范数正则化
poly = PolynomialFeatures(3) train_data_poly = poly.fit_transform(train_data) test_data_poly = poly.transform(test_data) clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'l1', alpha=0.00001) clf.fit(train_data_poly, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data_poly)) score_test = mean_squared_error(test_target, clf.predict(test_data_poly)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
ElasticNet L1和L2范数加权正则化
poly = PolynomialFeatures(3) train_data_poly = poly.fit_transform(train_data) test_data_poly = poly.transform(test_data) clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'elasticnet', l1_ratio=0.9, alpha=0.00001) clf.fit(train_data_poly, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data_poly)) score_test = mean_squared_error(test_target, clf.predict(test_data_poly)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
4.3.3 模型交叉验证
简单交叉验证 Hold-out-menthod
# 简单交叉验证 from sklearn.model_selection import train_test_split # 切分数据 # 切分数据 训练数据80% 验证数据20% train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0) clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print("SGDRegressor train MSE: ", score_train) print("SGDRegressor test MSE: ", score_test)
K折交叉验证 K-fold CV
# 5折交叉验证 from sklearn.model_selection import KFold kf = KFold(n_splits=5) for k, (train_index, test_index) in enumerate(kf.split(train)): train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index] clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print(k, " 折", "SGDRegressor train MSE: ", score_train) print(k, " 折", "SGDRegressor test MSE: ", score_test, '\n')
留一法 LOO CV
from sklearn.model_selection import LeaveOneOut loo = LeaveOneOut() num = 100 for k, (train_index, test_index) in enumerate(loo.split(train)): train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index] clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print(k, " 个", "SGDRegressor train MSE: ", score_train) print(k, " 个", "SGDRegressor test MSE: ", score_test, '\n') if k >= 9: break
留P法 LPO CV
from sklearn.model_selection import LeavePOut lpo = LeavePOut(p=10) num = 100 for k, (train_index, test_index) in enumerate(lpo.split(train)): train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index] clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print(k, " 10个", "SGDRegressor train MSE: ", score_train) print(k, " 10个", "SGDRegressor test MSE: ", score_test, '\n') if k >= 9: break
4.3.4 模型超参空间及调参
穷举网格搜索
from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split # 切分数据 # 切分数据 训练数据80% 验证数据20% train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0) randomForestRegressor = RandomForestRegressor() parameters = { 'n_estimators':[50, 100, 200], 'max_depth':[1, 2, 3] } clf = GridSearchCV(randomForestRegressor, parameters, cv=5) clf.fit(train_data, train_target) score_test = mean_squared_error(test_target, clf.predict(test_data)) print("RandomForestRegressor GridSearchCV test MSE: ", score_test) sorted(clf.cv_results_.keys())
随机参数优化
from sklearn.model_selection import RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split # 切分数据 # 切分数据 训练数据80% 验证数据20% train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0) randomForestRegressor = RandomForestRegressor() parameters = { 'n_estimators':[10, 50], 'max_depth':[1, 2, 5] } clf = RandomizedSearchCV(randomForestRegressor, parameters, cv=5) clf.fit(train_data, train_target) score_test = mean_squared_error(test_target, clf.predict(test_data)) print("RandomForestRegressor RandomizedSearchCV test MSE: ", score_test) sorted(clf.cv_results_.keys())
Lgb 调参
clf = lgb.LGBMRegressor(num_leaves=21)#num_leaves=31 parameters = { 'learning_rate': [0.01, 0.1], 'n_estimators': [20, 40] } clf = GridSearchCV(clf, parameters, cv=5) clf.fit(train_data, train_target) print('Best parameters found by grid search are:', clf.best_params_) score_test = mean_squared_error(test_target, clf.predict(test_data)) print("LGBMRegressor RandomizedSearchCV test MSE: ", score_test)
Lgb 线下验证
train_data2 = pd.read_csv('./zhengqi_train.txt',sep='\t') test_data2 = pd.read_csv('./zhengqi_test.txt',sep='\t') train_data2_f = train_data2[test_data2.columns].values train_data2_target = train_data2['target'].values
4.3.5 学习曲线和验证曲线
学习曲线
验证曲线