随机森林算法demo python spark

简介:

关键参数

最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth。

  • numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
  • maxDepth:是指森林中每一棵决策树最大可能depth,在决策树中提到了这个参数。更深的一棵树意味模型预测更有力,但同时训练时间更长,也更倾向于过拟合。但是值得注意的是,随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差,因此相对于单一决策树而言,不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。 
    甚至有的文献说,随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样,还是建议对maxDepth参数进行一定的实验,看看是否可以提高预测的效果。 
    另外还有两个参数,subsamplingRate,featureSubsetStrategy一般不需要调试,但是这两个参数也可以重新设置以加快训练,但是值得注意的是可能会影响模型的预测效果(如果需要调试的仔细读下面英文吧)。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide. 
The first two parameters we mention are the most important, and tuning them can often improve performance: 
(1)numTrees: Number of trees in the forest. 
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. 
Training time increases roughly linearly in the number of trees. 
(2)maxDepth: Maximum depth of each tree in the forest. 
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting. 
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest). 
The next two parameters generally do not require tuning. However, they can be tuned to speed up training. 
(3)subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training. 
(4)featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. 
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

复制代码
"""
Random Forest Classification Example.
"""
from __future__ import print_function

from pyspark import SparkContext
# $example on$
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# $example off$

if __name__ == "__main__":
    sc = SparkContext(appName="PythonRandomForestClassificationExample")
    # $example on$
    # Load and parse the data file into an RDD of LabeledPoint.
    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
    # Split the data into training and test sets (30% held out for testing)
    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a RandomForest model.
    #  Empty categoricalFeaturesInfo indicates all features are continuous.
    #  Note: Use larger numTrees in practice.
    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.
    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                         numTrees=3, featureSubsetStrategy="auto",
                                         impurity='gini', maxDepth=4, maxBins=32)

    # Evaluate model on test instances and compute test error
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))
    print('Learned classification forest model:')
    print(model.toDebugString())

    # Save and load model
    model.save(sc, "target/tmp/myRandomForestClassificationModel")
    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")
    # $example off$
复制代码

 模型样子:

复制代码
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 511 <= 0.0)
     If (feature 434 <= 0.0)
      Predict: 0.0
     Else (feature 434 > 0.0)
      Predict: 1.0
    Else (feature 511 > 0.0)
     Predict: 0.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 302 <= 0.0)
     If (feature 461 <= 0.0)
      If (feature 208 <= 107.0)
       Predict: 1.0
      Else (feature 208 > 107.0)
       Predict: 0.0
     Else (feature 461 > 0.0)
      Predict: 1.0
    Else (feature 302 > 0.0)
     Predict: 0.0
复制代码

 
















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7204096.html,如需转载请自行联系原作者

相关文章
|
5月前
|
算法 搜索推荐 JavaScript
基于python智能推荐算法的全屋定制系统
本研究聚焦基于智能推荐算法的全屋定制平台网站设计,旨在解决消费者在个性化定制中面临的选择难题。通过整合Django、Vue、Python与MySQL等技术,构建集家装设计、材料推荐、家具搭配于一体的一站式智能服务平台,提升用户体验与行业数字化水平。
|
6月前
|
存储 算法 调度
【复现】【遗传算法】考虑储能和可再生能源消纳责任制的售电公司购售电策略(Python代码实现)
【复现】【遗传算法】考虑储能和可再生能源消纳责任制的售电公司购售电策略(Python代码实现)
298 26
|
6月前
|
机器学习/深度学习 算法 机器人
【机器人路径规划】基于D*算法的机器人路径规划(Python代码实现)
【机器人路径规划】基于D*算法的机器人路径规划(Python代码实现)
313 0
|
6月前
|
机器学习/深度学习 算法 机器人
【机器人路径规划】基于改进型A*算法的机器人路径规划(Python代码实现)
【机器人路径规划】基于改进型A*算法的机器人路径规划(Python代码实现)
444 0
|
6月前
|
机器学习/深度学习 编解码 算法
【机器人路径规划】基于迪杰斯特拉算法(Dijkstra)的机器人路径规划(Python代码实现)
【机器人路径规划】基于迪杰斯特拉算法(Dijkstra)的机器人路径规划(Python代码实现)
515 4
|
6月前
|
机器学习/深度学习 算法 机器人
【机器人路径规划】基于A*算法的机器人路径规划研究(Python代码实现)
【机器人路径规划】基于A*算法的机器人路径规划研究(Python代码实现)
737 4
|
6月前
|
机器学习/深度学习 算法 机器人
【机器人路径规划】基于深度优先搜索(Depth-First-Search,DFS)算法的机器人路径规划(Python代码实现)
【机器人路径规划】基于深度优先搜索(Depth-First-Search,DFS)算法的机器人路径规划(Python代码实现)
330 3
|
6月前
|
算法 机器人 定位技术
【机器人路径规划】基于流场寻路算法(Flow Field Pathfinding)的机器人路径规划(Python代码实现)
【机器人路径规划】基于流场寻路算法(Flow Field Pathfinding)的机器人路径规划(Python代码实现)
340 4
机器学习/深度学习 算法 自动驾驶
1120 0
|
6月前
|
算法 定位技术 调度
基于蚂蚁优化算法的柔性车间调度研究(Python代码实现)
基于蚂蚁优化算法的柔性车间调度研究(Python代码实现)
271 0

热门文章

最新文章

推荐镜像

更多