快速入门Python机器学习（19）-阿里云开发者社区

快速入门Python机器学习（19）

2023-02-15 141

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

服务治理 MSE Sentinel/OpenSergo，Agent数量不受限

云原生网关 MSE Higress，422元/月

注册配置 MSE Nacos/ZooKeeper，118元/月

简介： 快速入门Python机器学习（19）

9.4 决策树回归（Decision Tree Regressor）

9.4.1类、属性和方法

类

class sklearn.tree.DecisionTreeRegressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0)

参数

属性

类型

解释

max_depth

int, default=None

树的最大深度。如果没有，则节点将展开，直到所有叶都是纯的，或者直到所有叶都包含少于min_samples_split samples的值。

criterion

{'mse', 'friedman_mse', 'mae', 'poisson'}, default='mse'

他的职能是衡量分裂的质量。

支持的标准是均方误差的'mse'，它等于作为特征选择标准的方差缩减，并使用每个终端节点的平均值最小化L2损失。

'friedman_mse'，它使用均方误差和friedman的潜在分裂改善分数，

'mae'表示平均绝对误差，它使用每个终端节点的中值最小化L1损失，

而'poisson'则使用泊松偏差的减少来寻找分裂。

属性

属性	解释
feature_importances_	ndarray of shape (n_features,)返回功能重要性。
max_features_	intmax_features的推断值。
n_features_	int执行拟合时的特征数。
n_outputs_	int执行拟合时的输出数。
tree_	Tree instance基础树对象。请参阅帮助(sklearn.tree._tree.Tree)对于树对象的属性，了解决策树结构对于这些属性的基本用法。

方法

apply(X[, check_input])	返回每个样本预测为的叶的索引。
cost_complexity_pruning_path(X, y[, …])	在最小代价复杂度修剪过程中计算修剪路径。
decision_path(X[, check_input])	返回树中的决策路径。
fit(X, y[, sample_weight, check_input, …])	从训练集（X，y）建立一个决策树回归器。
get_depth()	返回决策树的深度。
get_n_leaves()	返回决策树的叶数。
get_params([deep])	获取此估计器的参数。
predict(X[, check_input])	预测X的类或回归值。
score(X, y[, sample_weight])	返回预测的确定系数R2。
set_params(**params)	设置此估计器的参数。

9.4.2分析有噪音make_regression数据

def DecisionTreeRegressor_for_make_regression_add_noise():
       myutil = util()
       X,y = make_regression(n_samples=100,n_features=1,n_informative=2,noise=50,random_state=8)
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3)
       clf = DecisionTreeRegressor().fit(X,y)
       title = "make_regression DecisionTreeRegressor()回归线（有噪音）"
       myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
       myutil.draw_line(X[:,0],y,clf,title)
       myutil.plot_learning_curve(DecisionTreeRegressor(),X,y,title)
       myutil.show_pic(title)

输出

make_regression DecisionTreeRegressor()回归线（有噪音）:
100.00%
make_regression DecisionTreeRegressor()回归线（有噪音）:
100.00%

结果相当好

9.4.3分析波士顿房价数据

def DecisionTreeRegressor_for_boston():
       myutil = util()
       boston = datasets.load_boston()
       X,y = boston.data,boston.target
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state =8)
       for max_depth in [1,3,5,7]:
              clf = DecisionTreeRegressor(max_depth=max_depth)
              clf.fit(X_train,y_train)
              title=u"波士顿据测试集(max_depth="+str(max_depth)+")"
              myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
              myutil.plot_learning_curve(DecisionTreeRegressor(max_depth=max_depth),X,y,title)
              myutil.show_pic(title)

输出

波士顿据测试集(max_depth=1):
45.95%
波士顿据测试集(max_depth=1):
35.44%
波士顿据测试集(max_depth=3):
83.84%
波士顿据测试集(max_depth=3):
62.87%
波士顿据测试集(max_depth=5):
93.82%
波士顿据测试集(max_depth=5):
69.38%
波士顿据测试集(max_depth=7):
97.31%
波士顿据测试集(max_depth=7):
79.19%

max_depth=7的时候效果最好，但是所有情况都存在过拟合现象

9.4.4分析糖尿病数据

def DecisionTreeRegressor_for_diabetes():
       myutil = util()
       diabetes = datasets.load_diabetes()
       X,y = diabetes.data,diabetes.target
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state =8)
       for max_depth in [1,3,5,7]:
              clf = DecisionTreeRegressor(max_depth=max_depth)
              clf.fit(X_train,y_train)
              title=u"糖尿病据测试集(max_depth="+str(max_depth)+")"
              myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
              myutil.plot_learning_curve(DecisionTreeRegressor(max_depth=max_depth),X,y,title)
              myutil.show_pic(title)

输出

糖尿病据测试集(max_depth=1):
30.44%
糖尿病据测试集(max_depth=1):
15.21%
糖尿病据测试集(max_depth=3):
55.64%
糖尿病据测试集(max_depth=3):
28.37%
糖尿病据测试集(max_depth=5):
71.81%
糖尿病据测试集(max_depth=5):
18.06%
糖尿病据测试集(max_depth=7):
84.30%
糖尿病据测试集(max_depth=7):
-1.26%

过拟合现象非常严重，特别是max_depth越大的时候。

9.5 决策树剪枝处理

不管是决策树分类还是决策树回归，过拟合现象是决策树算法的最大问题，但是从“9.4.2分析有噪音make_regression数据”可以看到，决策树还是一种非常有效的方法，解决过拟合现象有以下两种方法：

剪枝处理
随机森林

随机森林的属于集成学习的一类，我们将在下一章进行介绍。现在介绍一下剪枝。

预剪枝(Pre-pruning)：及早停止树的增长，也是sklearn中用的方法。
后剪枝(post-pruning)：先形成树，再剪枝。

def decision_tree_pruning():
myutil = util()
cancer = datasets.load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)#stratify:分层
# 构件树，不剪枝
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train,y_train)
title = "不剪枝，训练数据集上的精度"
myutil.print_scores(tree,X_train,y_train,X_test,y_test,title)
print("不剪枝，树的深度:{}".format(tree.get_depth()))
# 构件树，剪枝
tree = DecisionTreeClassifier(max_depth=4,random_state=0)
tree.fit(X_train,y_train)
title = "剪枝，训练数据集上的精度"
myutil.print_scores(tree,X_train,y_train,X_test,y_test,title)
print("剪枝，树的深度:{}".format(tree.get_depth()))

输出

不剪枝，训练数据集上的精度:
100.00%
不剪枝，训练数据集上的精度:
93.71%
不剪枝，树的深度:7
剪枝，训练数据集上的精度:
98.83%
剪枝，训练数据集上的精度:
95.10%
剪枝，树的深度:4

9.6决策树可视化

#pip3 install graphviz

# Graphviz 是一款由 AT&T Research 和 Lucent Bell 实验室开源的可视化图形工具

from sklearn.tree import export_graphviz
import graphviz
def show_tree():
    wine = datasets.load_wine()
    # 仅选前两个特征
    X = wine.data[:,:2]
    y = wine.target
    X_train,X_test,y_train,y_test = train_test_split(X, y)
    clf = DecisionTreeClassifier(max_depth=3)#为了图片不太大选择max_depth=3
    clf.fit(X_train,y_train) export_graphviz(clf,out_file="wine.dot",class_names=wine.target_names,feature_names=wine.feature_names[:2],impurity=False,filled=True)
    #打开dot文件
    with open("wine.dot") as f:
        dot_graph = f.read()
    graphviz.Source(dot_graph)

安装graphviz软件，打开wine.dot

快速入门Python机器学习（19）

9.4 决策树回归（Decision Tree Regressor）

9.4.1类、属性和方法

9.4.2分析有噪音make_regression数据

9.4.3分析波士顿房价数据

9.4.4分析糖尿病数据

9.5 决策树剪枝处理

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

快速入门Python机器学习（19）

9.4 决策树回归（Decision Tree Regressor）

9.4.1类、属性和方法

9.4.2分析有噪音make_regression数据

9.4.3分析波士顿房价数据

9.4.4分析糖尿病数据

9.5 决策树剪枝处理

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像