10 集成学习

10.1随机森林算法(RandomForest)

10.1.1概念

2001年Breiman把分类树组合成随机森林(Breiman 2001a)，即在变量(列)的使用和数据(行)的使用上进行随机化，生成很多分类树，再汇总分类树的结果。随机森林在运算量没有显著提高的前提下提高了预测精度。

算法流程：

构建决策树的个数t，单颗决策树的特征个数f，m个样本，n个特征数据集

1 单颗决策树训练

1.1 采用有放回抽样，从原数据集经过m次抽样，获得有m个样本的数据集(可能有重复样本)

1.2 从n个特征里，采用无放回抽样原则，去除f个特征作为输入特征

1.3 在新的数据集(m个样本， f个特征数据集上)构建决策树

1.4 重复上述过程t次，构建t棵决策树

2 随机森林的预测结果

生成t棵决策树，对于每个新的测试样例，综合多棵决策树预测的结果作为随机森林的预测结果。

回归问题：取t棵决策树预测值的平均值作为随机森林预测结果

分类问题：少数服从多数的原则，取单棵的分类结果作为类别随机森林预测结果

Sklearn中RandomForestClassifier和RandomForestRegressor分类和回归树算法

10.1.2 随机森林分类法

类参数、属性和方法

类

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

参数

参数	类型	解释
n_estimators	int, default=100	森林中树木的数量。
random_state	RandomState instance or None, default=None	控制生成树时使用的样本引导的随机性（如果bootstrap=True）和在每个节点上查找最佳分割时要考虑的特征的采样（如果max_features < n_features）。

属性

属性	解释
base_estimator_	DecisionTreeClassifier用于创建拟合子估计器集合的子估计器模板。
estimators_	list of DecisionTreeClassifier拟合子估计量的集合。
classes_	ndarray of shape (n_classes,) or a list of such arrays形状数组（n个类）或此类数组的列表类标签（单输出问题），或类标签数组的列表（多输出问题）。
n_classes_	int or list类数（单输出问题），或包含每个输出的类数的列表（多输出问题）。
n_features_	int执行拟合时的特征数。
n_outputs_	int执行拟合时的输出数。
feature_importances_	ndarray of shape (n_features,)基于杂质的特征非常重要。
oob_score_	float使用现成的估计值获得的训练数据集的得分。只有当oob_score为True时，此属性才存在。
oob_decision_function_	ndarray of shape (n_samples, n_classes)利用训练集上的包外估计计算决策函数。如果nèu估计量很小，则可能在引导过程中从未遗漏数据点。在这种情况下，oob_decision_function_可能包含NaN。只有当oob_score为True时，此属性才存在。

方法

apply(X)	将森林中的树应用到X，返回叶子数。
decision_path(X)	返回林中的决策路径。
fit(X, y[, sample_weight])	从训练集（X，y）建立一个树的森林。
get_params([deep])	获取此估计器的参数。
predict(X)	预测X的类。
predict_log_proba(X)	预测X的类对数概率。
predict_proba(X)	预测X的类概率。
score(X, y[, sample_weight])	返回给定测试数据和标签的平均精度。
set_params(**params)	设置此估计器的参数。

随机森林分类参数

ef base_of_decision_tree_forest(n_estimator,random_state,X,y,title):
    myutil = util()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    clf = RandomForestClassifier(n_estimators=n_estimator, random_state=random_state,n_jobs=2)#n_jobs:设置为CPU个数
    # 在训练数据集上进行学习
    clf.fit(X_train, y_train)
    cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF’])
    cmap_bold =  ListedColormap(['#FF0000','#00FF00','#0000FF’])
    #分别将样本的两个特征值创建图像的横轴和纵轴
    x_min,x_max = X_train[:,0].min()-1,X_train[:,0].max()+1
    y_min,y_max = X_train[:,1].min()-1,X_train[:,1].max()+1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                      np.arange(y_min, y_max, .02))
    #给每个样本分配不同的颜色
    Z = clf.predict(np.c_[xx.ravel(),yy.ravel()])
    Z = Z.reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=cmap_light,shading='auto')
       #用散点把样本表示出来
       plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,s=20,edgecolors='k')
       plt.xlim(xx.min(),xx.max()) 
       plt.ylim(yy.min(),yy.max())
       title = title+"数据随机森林训练集得分(n_estimators:"+str(n_estimator)+",random_state:"+str(random_state)+")"
       myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
def tree_forest():
       myutil = util()
       title = ["鸢尾花","红酒","乳腺癌"]
       j = 0
       for datas in [datasets.load_iris(),datasets.load_wine(),datasets.load_breast_cancer()]:
              #定义图像中分区的颜色和散点的颜色
              figure,axes = plt.subplots(4,4,figsize =(100,10))
              plt.subplots_adjust(hspace=0.95)
              i = 0
              # 仅选前两个特征
              X = datas.data[:,:2]
              y = datas.target
              mytitle =title[j]
              for n_estimator in range(4,8):
                     for random_state in range(2,6):
                            plt.subplot(4,4,i+1)
                            plt.title("n_estimator:"+str(n_estimator)+"random_state:"+str(random_state))
                            plt.suptitle(u"随机森林分类")
                            base_of_decision_tree_forest(n_estimator,random_state,X,y,mytitle)
                            i = i + 1
              myutil.show_pic(mytitle)
              j = j+1

鸢尾花

		n_estimators
		4		5		6		7
		训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	89.29%	71.05%	91.07%	71.05%	91.07%	78.95%	91.96%	76.32%
	4	91.07%	68.42%	89.29%	76.32%	94.64%	60.53%	91.07%	76.32%
	4	91.07%	71.05%	93.75%	78.95%	93.75%	68.42%	91.07%	71.05%
	5	90.18%	68.42%	92.86%	78.95%	91.96%	60.53%	92.86%	76.32%

红酒

		n_estimators
		4		5		8		7
		训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	96.24%	80.00%	98.50%	68.89%	96.99%	82.22%	97.74%	82.22%
	3	92.48%	86.67%	97.74%	80.00%	96.24%	73.33%	98.50%	68.89%
	4	95.49%	75.56%	93.98%	84.44%	96.99%	84.44%	98.50%	91.11%
	5	96.24%	68.89%	96.99%	82.22%	96.99%	71.11%	98.50%	75.56%

乳腺癌

		n_estimators
		4		5		6		7
		训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	97.18%	81.82%	97.65%	90.91%	98.83%	89.51%	99.30%	85.31%
	3	97.89%	85.31%	97.65%	80.42%	97.89%	86.71%	98.59%	84.62%
	4	97.42%	82.52%	97.89%	88.11%	97.42%	89.51%	97.89%	90.91%
	5	97.18%	87.41%	98.83%	88.11%	98.12%	88.81%	98.83%	88.81%

输出

import mglearn
def my_RandomForet():
  # 生成一个用于模拟的二维数据集
  X, y = datasets.make_moons(n_samples=100, noise=0.25, random_state=3)
  # 训练集和测试集的划分
  X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=42)
  # 初始化一个包含 5 棵决策树的随机森林分类器
  forest = RandomForestClassifier(n_estimators=5, random_state=2)
  # 在训练数据集上进行学习
  forest.fit(X_train, y_train)
  # 可视化每棵决策树的决策边界
  fig, axes = plt.subplots(2, 3, figsize=(20, 10))
  for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
      ax.set_title('Tree {}'.format(i))
      mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)
      print("决策树"+str(i)+"训练集得分:{:.2%}".format(tree.score(X_train,y_train)))
      print("决策树"+str(i)+"测试集得分:{:.2%}".format(tree.score(X_test,y_test)))       
        # 可视化集成分类器的决策边界
      print("随机森林训练集得分:{:.2%}".format(forest.score(X_train,y_train)))
      print("随机森林测试集得分:{:.2%}".format(forest.score(X_test,y_test)))
      mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],alpha=0.4)
      axes[-1, -1].set_title('Random Forest')
      mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
      plt.show()

决策树0训练集得分:89.33%
决策树0测试集得分:84.00%
决策树1训练集得分:96.00%
决策树1测试集得分:88.00%
决策树2训练集得分:97.33%
决策树2测试集得分:80.00%
决策树3训练集得分:89.33%
决策树3测试集得分:92.00%
决策树4训练集得分:92.00%
决策树4测试集得分:88.00%
随机森林训练集得分:96.00%
随机森林测试集得分:92.00%

虽然决策树3不存在过拟合，决策树4的差值与随机森林得分一致，但是随机森林得分比他们都要高。

	训练集	测试集	差值
随机森林	96.00%	92.00%	4
决策树0	89.33%	84.00%	5
决策树1	96.00%	88.00%	8
决策树2	97.33%	80.00%	17
决策树3	89.33%	92.00%	-3
决策树4	92.00%	88.00%	4

随机森林分类参数散点图分析实例

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/,下载adult.dat文件，改为adult.csv

import pandas as pd
def income_forecast():
        data=pd.read_csv('adult.csv', header=None,index_col=False,
                  names=['年龄','单位性质','权重','学历','受教育时长',
                       '婚姻状况','职业','家庭情况','种族','性别',
                       '资产所得','资产损失','周工作时长','原籍',
                       '收入'])
        #为了方便展示，我们选取其中一部分数据
        data_title = data[['年龄','单位性质','学历','性别','周工作时长','职业','收入']]
        print(data_title.head())
        #利用shape方法获取数据集的大小
        data_title.shape
        print("data_title.shape:\n",data_title.shape)
        data_title.info()

单位性质、学历、性别、职业、收入均不是数值类型

输出

年龄单位性质学历 ... 周工作时长职业收入

0 39 State-gov Bachelors ... 40 Adm-clerical <=50K

1 50 Self-emp-not-inc Bachelors ... 13 Exec-managerial <=50K

2 38 Private HS-grad ... 40 Handlers-cleaners <=50K

3 53 Private 11th ... 40 Handlers-cleaners <=50K

4 28 Private Bachelors ... 40 Prof-specialty <=50K

data_title.shape:

(32561, 7)

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 年龄 32561 non-null int64

1 单位性质 32561 non-null object

2 学历 32561 non-null object

3 性别 32561 non-null object

4 周工作时长 32561 non-null int64

5 职业 32561 non-null object

6 收入 32561 non-null object

dtypes: int64(2), object(5)

memory usage: 1.7+ MB

##1-数据准备
#1.2 数据预处理
#用get_dummies将文本数据转化为数值
data_dummies=pd.get_dummies(data_title)
print("data_dummies.shape:\n",data_dummies.shape)
#对比样本原始特征和虚拟变量特征---df.columns获取表头
print('样本原始特征:\n',list(data_title.columns),'\n')
print('虚拟变量特征:\n',list(data_dummies.columns))
##2-数据建模---拆分数据集/模型训练/测试
#2.1将数据拆分为训练集和测试集---要用train_test_split模块中的train_test_split()函数，随机将75%数据化为训练集，25%数据为测试集
#导入数据集拆分工具  
#拆分数据集---x,y都要拆分，rain_test_split(x,y,random_state=0),random_state=0使得每次生成的伪随机数不同
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
#查看拆分后的数据集大小情况
print('x_train_shape:{}'.format(x_train.shape))
print('x_test_shape:{}'.format(x_test.shape))
print('y_train_shape:{}'.format(y_train.shape))
print('y_test_shape:{}'.format(y_test.shape))
##2、数据建模---模型训练/测试---决策树算法
#2.2 模型训练---算法.fit(x_train,y_train)
#使用算法
tree = DecisionTreeClassifier(max_depth=5)#这里参数max_depth最大深度设置为5
#算法.fit(x,y)对训练数据进行拟合
tree.fit(x_train, y_train)
##2、数据建模---拆分数据集/模型训练/测试---决策树算法
#2.3 模型测试---算法.score(x_test,y_test)
score_test=tree.score(x_test,y_test)
score_train=tree.score(x_train,y_train)
print('test_score:{:.2%}'.format(score_test))
print('train_score:{:.2%}'.format(score_train))
##3、模型应用---算法.predict(x_new)---决策树算法
#导入要预测数据--可以输入新的数据点，也可以随便取原数据集中某一数据点，但是注意要与原数据结构相同
x_new=[[37,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]]
#37岁，机关工作，硕士，男，每周工作40小时，文员
prediction=tree.predict(x_new)
print('预测数据:{}'.format(x_new))
print('预测结果:{}'.format(prediction))

输出

特征形态:(32561, 44) 标签形态:(32561,)
x_train_shape:(24420, 44)
x_test_shape:(8141, 44)
y_train_shape:(24420,)
y_test_shape:(8141,)
test_score:79.62%
train_score:80.34%
预测数据:[[37, 40, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
预测结果:[0]

快速入门Python机器学习（20）

10 集成学习

10.1随机森林算法(RandomForest)

10.1.1概念

10.1.2 随机森林分类法

类参数、属性和方法

随机森林分类参数

随机森林分类参数散点图分析实例

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

快速入门Python机器学习（20）

10 集成学习

10.1随机森林算法(RandomForest)

10.1.1概念

10.1.2 随机森林分类法

类参数、属性和方法

随机森林分类参数

随机森林分类参数散点图分析实例

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像