我有一组由Kafka流式传输的json消息，每个消息都描述一个网站用户。使用pyspark，我需要计算每个国家/地区每个流媒体窗口的用户数，并返回具有最大和最小用户数的国家/地区。

以下是流式json消息的示例：

{"id":1,"first_name":"Barthel","last_name":"Kittel","email":"bkittel0@printfriendly.com","gender":"Male","ip_address":"130.187.82.195","date":"06/05/2018","country":"France"}
这是我的代码：

from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark import SparkContext
from pyspark.sql import SQLContext

fields = ['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'date', 'country']
schema = StructType([
StructField(field, StringType(), True) for field in fields
])

def parse(s, fields):

try:
    d = json.loads(s[0])
    return [tuple(d.get(field) for field in fields)]
except:
    return []

array_of_users = parsed.SQLContext.createDataFrame(parsed.flatMap(lambda s: parse(s, fields)), schema)

rdd = sc.parallelize(array_of_users)

group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples
country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

identify the min and max using as comparison key the second element of the (country, length) tuple
country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])
当我运行它时，我收到消息

AttributeError Traceback (most recent call last)
in ()

 16         return []
 17

---> 18 array_of_users = parsed.SQLContext.createDataFrame(parsed.flatMap(lambda s: parse(s, fields)), schema)

 19 
 20 rdd = sc.parallelize(array_of_users)

AttributeError: 'TransformedDStream' object has no attribute 'SQLContext'
我怎样才能解决这个问题？

如果我理解正确，您需要按国家/地区对邮件列表进行分组，然后计算每个组中的邮件数，然后选择具有最小和最大邮件数的组。

在我的脑海中，代码将是这样的：

assuming the array_of_users is your array of messages

rdd = sc.parallelize(array_of_users)

group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples

country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

identify the min and max using as comparison key the second element of the (country, length) tuple

country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])

pyspark - 在json流数据中找到max和min usign createDataFrame

assuming the array_of_users is your array of messages

group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples

identify the min and max using as comparison key the second element of the (country, length) tuple

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

pyspark - 在json流数据中找到max和min usign createDataFrame

assuming the array_of_users is your array of messages

group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples

identify the min and max using as comparison key the second element of the (country, length) tuple

相关课程

相关文章

相关电子书