# pyspark MLlib踩坑之model predict+rdd map zip，zip使用尤其注意啊啊啊！

When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:

val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}

Unfortunately similar approach in PySpark doesn't work so well:

labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Instead of that official documentation recommends something like this:

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?

zip is generally speaking a tricky operation. It requires both RDDs not only to have the same number of partitions but also the same number of elements per partition.

Excluding some special cases this is guaranteed only if both RDDs have the same ancestor and there are not shuffles and operations potentially changing number of elements (filterflatMap) between the common ancestor and the current state. Typically it means only map (1-to-1) transformations.

1、直接将rdd做union操作，rdd = rdd.union(sc.parallelize([]))，然后map，zip就能输出正常结果了！

2、或者是直接将预测的rdd collect到driver机器，使用model predict，是比较丑陋的做法！

 

+ 订阅