③机器学习分类算法之随机森林（集成学习算法）-阿里云开发者社区

③机器学习分类算法之随机森林（集成学习算法）

2022-07-07 321

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 机器学习分类算法之随机森林（集成学习算法）

min_samples_split优化

# min_samples_split优化
scorel = []
for i in range(2,20):
    RFC = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=i,
                                 n_jobs=-1,
                                 random_state=90).fit(X_train,y_train)
    score = RFC.score(X_test,y_test)
    scorel.append(score)
print(max(scorel),([*range(2,20)][scorel.index(max(scorel))]))  #112是最优的估计器数量 #最优得分是0.951462
plt.figure(figsize=[20,5])
plt.plot(range(2,20),scorel) 
plt.show()

调整max_features及其他参数

## 调整max_features
param_grid = {'max_features':['auto', 'sqrt','log2']}
RFC = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=2
                             )
GS = GridSearchCV(RFC,param_grid,cv=10)
GS.fit(X,y)
print(GS.best_params_ ) #最佳最大特征方法为log2  
print(GS.best_score_)
param_grid = {'criterion':['gini', 'entropy']}
RFC = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=2,max_features='log2')
GS = GridSearchCV(RFC,param_grid,cv=10)
GS.fit(X,y)
print(GS.best_params_ )
print(GS.best_score_)
# 调整min_samples_leaf
param_grid = {'min_samples_leaf':np.arange(1, 11, 1)} 
RFC = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=2,max_features='log2',criterion='gini')
GS = GridSearchCV(RFC,param_grid,cv=10)
GS.fit(X,y)
print(GS.best_params_ )
print(GS.best_score_)
scorel = []
for i in range(2,20):
    RFC = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=2,max_features='log2',criterion='gini',
                                 n_jobs=-1,
                                 random_state=90).fit(X_train,y_train)
    score = RFC.score(X_test,y_test)
    scorel.append(score)
print(max(scorel),([*range(2,20)][scorel.index(max(scorel))]))  #112是最优的估计器数量 #最优得分是0.951462
plt.figure(figsize=[20,5])
plt.plot(range(2,20),scorel) 
plt.show()

网格搜索（电脑性能强）

#超参数配置
param_knn = {
'n_estimators': list(range(3,100,1)),
'max_depth':list(range(3,30,1)),
'max_features':['auto', 'sqrt','log2'],
'min_samples_leaf':list(range(1,20)),
'criterion':['gini', 'entropy'],
'min_samples_leaf':list(range(1,11))
}
#KNN的超参数
gsearch = GridSearchCV( model , param_grid = param_knn )
gsearch.fit( X_train, y_train )
gsearch.best_params_
gsearch.best_score_
best_=gsearch.best_estimator_
print(best_)

根据手动调参的测试，这里想要尝试一下网格搜索，但是我的机器无法完成，最后发现是内存不足，其次我去查询了一些资料，在随机森林中，一般是不需要做交叉验证的，这里的shift折交叉对电脑的显卡肯定要求严格，如果觉得自己的配置还可以的小伙伴，可以去测试一下这个代码。

随机森林做特征筛选（可视化）

feat_labels = df.columns[:-1]
# n_jobs  整数 可选（默认=1） 适合和预测并行运行的作业数，如果为-1，则将作业数设置为核心数
forest = RandomForestClassifier(max_depth=20,n_estimators=51,min_samples_leaf=1,min_samples_split=2,max_features='log2',criterion='gini',
                                random_state=0, n_jobs=-1)
forest.fit(X_train, y_train)
labe_name=[]
imports=[]
# 下面对训练好的随机森林，完成重要性评估
# feature_importances_  可以调取关于特征重要程度
importances = forest.feature_importances_
print("重要性：",importances)
x_columns =df.columns[:-1]
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
# 对于最后需要逆序排序，我认为是做了类似决策树回溯的取值，从叶子收敛
# 到根，根部重要程度高于叶子。
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
    labe_name.append(feat_labels[indices[f]])
    imports.append(importances[indices[f]])
# # 筛选变量（选择重要性比较高的变量）
# threshold = 0.03
# x_selected = X_train.iloc[:,:-1][:,importances > threshold]
plt.figure(figsize=(10,8))
sns.barplot(imports,labe_name, orient='h')

虽然这里筛选的特征对我们的随机森林的模型没有什么提升，但是你可以依据该模型的特征筛选去做其他模型，比如支持向量机等

结论：随机森林在这个数据集上，只需要调节n_estimators即可，就可以达到最好的效果，虽然每个参数我都尝试去利用学习曲线做优化迭代，但是效果和默认值差不多，其次利用随机森林做特征筛选，将权重大的特征带入，效果并不好，最后我还发现一个问题，随机森林因为是又放回抽样，这里的特征顺序也会影响到模型的效果，当我们把特征按照权重顺序排列。

最佳模型代码

# 最终的模型效果和代码
# 导入数据 分割数据
df=pd.read_csv(r"\数据.csv")
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
# 导入模型
model=RandomForestClassifier(n_estimators=51,max_depth=20,max_features='sqrt',criterion='gini',n_jobs=-1,random_state=90)
# 训练模型
model.fit(X_train,y_train)
# 预测值
y_pred = model.predict(X_test)
'''
评估指标
'''
# 求出预测和真实一样的数目
true = np.sum(y_pred == y_test )
print('预测对的结果数目为：', true)
print('预测错的的结果数目为：', y_test.shape[0]-true)
# 评估指标
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,cohen_kappa_score
print('预测数据的准确率为： {:.4}%'.format(accuracy_score(y_test,y_pred)*100))
print('预测数据的精确率为：{:.4}%'.format(
      precision_score(y_test,y_pred)*100))
print('预测数据的召回率为：{:.4}%'.format(
      recall_score(y_test,y_pred)*100))
# print("训练数据的F1值为：", f1score_train)
print('预测数据的F1值为：',
      f1_score(y_test,y_pred))
print('预测数据的Cohen’s Kappa系数为：',
      cohen_kappa_score(y_test,y_pred))
# 打印分类报告
print('预测数据的分类报告为：','\n',
      classification_report(y_test,y_pred))
# 这行代码在jupyter notebook 上面运行不起，内存不足，需要使用本地的pycharm,好像我的也跑不起
# score_pre = cross_val_score(model,X_test,y_test,cv=5).mean() #利用所有数据，进行交叉验证以后0.976变差了
# print("十折交叉验证的平均得分{}".format(score_pre))
# ROC曲线、AUC
from sklearn.metrics import precision_recall_curve
from sklearn import metrics
# 预测正例的概率
y_pred_prob=model.predict_proba(X_test)[:,1]
# y_pred_prob ,返回两列，第一列代表类别0,第二列代表类别1的概率
#https://blog.csdn.net/dream6104/article/details/89218239
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred_prob, pos_label=2)
#pos_label，代表真阳性标签，就是说是分类里面的好的标签，这个要看你的特征目标标签是0,1，还是1,2
roc_auc = metrics.auc(fpr, tpr)  #auc为Roc曲线下的面积
# print(roc_auc)
plt.figure(figsize=(8,6))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.plot(fpr, tpr, 'r',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
# plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1.1])
plt.ylim([0, 1.1])
plt.xlabel('False Positive Rate') #横坐标是fpr
plt.ylabel('True Positive Rate')  #纵坐标是tpr
plt.title('Receiver operating characteristic example')
plt.show()

效果还是不错的，各个评估指标都比较的乐观

写到最后：

随机森林，一般还是不错的，至于参数调优，可以多测试几个版本，不管是网格搜索还是自定义迭代调参，还是先验知识，都应该跳出传统的思维，多去比较和测试，最后出来的效果才是你真正需要的，不要嫌麻烦，好的模型都是一步一步的调配出来的。

③机器学习分类算法之随机森林（集成学习算法）

min_samples_split优化

调整max_features及其他参数

网格搜索（电脑性能强）

随机森林做特征筛选（可视化）

最佳模型代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

③机器学习分类算法之随机森林（集成学习算法）

min_samples_split优化

调整max_features及其他参数

网格搜索（电脑性能强）

随机森林做特征筛选（可视化）

最佳模型代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景