>>> v=sc.parallelize(["one", "two", "two", "three", "three", "three"]) >>> v2=v.map(lambda x: (x,1)) >>> v2.collect() [('one', 1), ('two', 1), ('two', 1), ('three', 1), ('three', 1), ('three', 1)] >>> v3=v2.groupByKey() >>> v3.collect() [('one', <pyspark.resultiterable.ResultIterable object at 0x7fd3c7850e90>), ('two', <pyspark.resultiterable.ResultIterable object at 0x7fd3c7850f10>), ('three', <pyspark.resultiterable.ResultIterable object at 0x7fd3c6dc83d0>)] >>> v4=v3.filter(lambda x:len(x[1].data)>2) >>> v4.collect() [('three', <pyspark.resultiterable.ResultIterable object at 0x7fd3c6dc8510>)]
过滤了出现次数大于2的结果
本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7764934.html,如需转载请自行联系原作者