python spark 随机森林入门demo-阿里云开发者社区

python spark 随机森林入门demo

2017-11-16 1697

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

class pyspark.mllib.tree.RandomForest[source]

Learning algorithm for a random forest model for classification or regression.

New in version 1.2.0.

supportedFeatureSubsetStrategies = ('auto', 'all', 'sqrt', 'log2', 'onethird')

classmethod trainClassifier( data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None) [source]

Train a random forest model for binary or multiclass classification.

Parameters:

Parameters:	data – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}. numClasses – Number of classes for classification. categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}. numTrees – Number of trees in the random forest. featureSubsetStrategy – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”) impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”) maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4) maxBins – Maximum number of bins used for splitting features. (default: 32) seed – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)
Returns:	RandomForestModel that can be used for prediction.

data – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
numClasses – Number of classes for classification.
categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
numTrees – Number of trees in the random forest.
featureSubsetStrategy – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)
impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
maxBins – Maximum number of bins used for splitting features. (default: 32)
seed – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)

Returns:

RandomForestModel that can be used for prediction.

Example usage:

    >>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest >>> >>> data = [ ... LabeledPoint(0.0, [0.0]), ... LabeledPoint(0.0, [1.0]), ... LabeledPoint(1.0, [2.0]), ... LabeledPoint(1.0, [3.0]) ... ] >>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42) >>> model.numTrees() 3 >>> model.totalNumNodes() 7 >>> print(model) TreeEnsembleModel classifier with 3 trees >>> print(model.toDebugString()) TreeEnsembleModel classifier with 3 trees  Tree 0:  Predict: 1.0  Tree 1:  If (feature 0 <= 1.0)  Predict: 0.0  Else (feature 0 > 1.0)  Predict: 1.0  Tree 2:  If (feature 0 <= 1.0)  Predict: 0.0  Else (feature 0 > 1.0)  Predict: 1.0 >>> model.predict([2.0]) 1.0 >>> model.predict([0.0]) 0.0 >>> rdd = sc.parallelize([[3.0], [1.0]]) >>> model.predict(rdd).collect() [1.0, 0.0] 
   

New in version 1.2.0.

摘自：https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

本文转自张昺华-sky博客园博客，原文链接：http://www.cnblogs.com/bonelee/p/7150484.html

，如需转载请自行联系原作者

python spark 随机森林入门demo

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

python spark 随机森林入门demo

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像