(3)、详细文档
class RandomForestRegressor Found at: sklearn.ensemble.forest
class RandomForestRegressor(ForestRegressor):
Examples
--------
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=4, n_informative=2,
... random_state=0, shuffle=False)
>>> regr = RandomForestRegressor(max_depth=2, random_state=0)
>>> regr.fit(X, y)
RandomForestRegressor(bootstrap=True, criterion='mse',
max_depth=2,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> print(regr.feature_importances_)
[ 0.17339552 0.81594114 0. 0.01066333]
>>> print(regr.predict([[0, 0, 0, 0]]))
[-2.50699856]
Notes
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets.
To
reduce memory consumption, the complexity and size of the trees
should be
controlled by setting those parameter values.
The features are always randomly permuted at each split. Therefore,
the best found split may vary, even with the same training data,
``max_features=n_features`` and ``bootstrap=False``, if the improvement
of the criterion is identical for several splits enumerated during the
search of the best split. To obtain a deterministic behaviour during
fitting, ``random_state`` has to be fixed.
References
----------
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,
2001.
See also
--------
DecisionTreeRegressor, ExtraTreesRegressor
"""
def __init__(self,
n_estimators=10,
criterion="mse",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features="auto",
max_leaf_nodes=None,
min_impurity_decrease=0.,
min_impurity_split=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False):
super(RandomForestRegressor, self).__init__
(base_estimator=DecisionTreeRegressor(), n_estimators=n_estimators,
estimator_params=("criterion", "max_depth", "min_samples_split",
"min_samples_leaf", "min_weight_fraction_leaf", "max_features",
"max_leaf_nodes", "min_impurity_decrease", "min_impurity_split",
"random_state"), bootstrap=bootstrap, oob_score=oob_score,
n_jobs=n_jobs, random_state=random_state, verbose=verbose,
warm_start=warm_start)
self.criterion = criterion
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes
self.min_impurity_decrease = min_impurity_decrease
self.min_impurity_split = min_impurity_split
2、RF用于分类
0、RF用于分类与用于回归的两点区别
(1)、第一个不同是用于判断分割点质量的标准。DT的训练过程,需要尝试所有可能的属性,针对每个属性尝试所有可能的分割点,然后从中找出最佳的属性及其分割点。对于回归决策树,分割点的质量是由平方误差和( sum squared error)决定的。但是平方误差和对于分类问题就不起作用了,需要类似误分类率的指标来描述。
(2)、
1、sklearn.ensemble.RandomForestClassifier类构造函数
(1)、重点参数解释
Criterion:字符串,可选(缺省值为“gini”)。可能的取值如下。
Gini :利用基尼不纯度(Gini impurity)。
Entropy :利用基于熵的信息增益。
当训练数据终结于决策树的叶子节点时,叶子节点含有属于不同类别的数据,则根据叶子节点中不同类别的数据所占的百分比,分类决策树自然就可以得到数据属于某个类别的概率。依赖于具体的应用,可能想直接获得上述的概率,或者想直接将叶子节点中所占数据最多的类别作为预测值返回。如果在获得预测结果的同时想要调整阈值,则需要获得概率值。为了生成曲线下面积(area under the curve,AUC),可能想获得接收者操作特征曲线(receiver operating curve,ROC) 及其概率以保证精确度。如果想计算误分类率,则需要将概率转换为类别的预测。
(2)、重点方法解释
Fit( X, y, sample_weight=None):随机森林分类版本惟一的不同在于标签y 的特征。对于分类问题,标签的取值为0 到类别数减1 的整数。对于二分类问题,标签的取值是0 或1。对于有nClass 个不同类别的多类别分类问题,标签是从0 到nClass-1 的整数。
Predict(X):对于属性矩阵(二维的numpy 数组)X,此函数产生所属类别的预测。它生成一个单列的数组,行数等于X 的行数。每个元素是预测的所属类别,不管问题是二分类问题还是多类别分类问题,都是一样的。
Predict_proba(X):这个版本的预测函数产生一个二维数组。行数等于X 的行数。列数就是预测的类别数(对于二分类问题就是2 列)。每行的元素就是对应类别的概率。
Predict_log_proba(X):这个版本的预测函数产生一个与predict_proba 相似的二维数组。但是显示的不是所属类别的概率,而是概率的log 值。
(3)、详细文档
class RandomForestClassifier Found at: sklearn.ensemble.forest
class RandomForestClassifier(ForestClassifier):
Examples
--------
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>>
>>> X, y = make_classification(n_samples=1000, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> print(clf.feature_importances_)
[ 0.17287856 0.80608704 0.01884792 0.00218648]
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]
Notes
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data
sets. To
reduce memory consumption, the complexity and size of the trees
should be
controlled by setting those parameter values.
The features are always randomly permuted at each split. Therefore,
the best found split may vary, even with the same training data,
``max_features=n_features`` and ``bootstrap=False``, if the
improvement
of the criterion is identical for several splits enumerated during the
search of the best split. To obtain a deterministic behaviour during
fitting, ``random_state`` has to be fixed.
References
----------
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,
2001.
See also
--------
DecisionTreeClassifier, ExtraTreesClassifier
"""
def __init__(self,
n_estimators=10,
criterion="gini",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features="auto",
max_leaf_nodes=None,
min_impurity_decrease=0.,
min_impurity_split=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False,
class_weight=None):
super(RandomForestClassifier, self).__init__
(base_estimator=DecisionTreeClassifier(), n_estimators=n_estimators,
estimator_params=("criterion", "max_depth", "min_samples_split",
"min_samples_leaf", "min_weight_fraction_leaf", "max_features",
"max_leaf_nodes", "min_impurity_decrease", "min_impurity_split",
"random_state"), bootstrap=bootstrap, oob_score=oob_score,
n_jobs=n_jobs, random_state=random_state, verbose=verbose,
warm_start=warm_start, class_weight=class_weight)
self.criterion = criterion
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes
self.min_impurity_decrease = min_impurity_decrease
self.min_impurity_split = min_impurity_split