如何在两个数组列中查找公共元素？

我有两个以逗号分隔的字符串列（sourceAuthors和targetAuthors）。

val df = Seq(
("Author1,Author2,Author3","Author2,Author3,Author1")
).toDF("source","target")
我想添加另一个列nCommonAuthors，其中包含常见作者的数量。

我试过这样做：

def myUDF = udf { (s1: String, s2: String) =>
s1.split(",")
s2.split(",")
s1.intersect(s2).length
}
val newDF = myDF.withColumn("nCommonAuthors", myUDF($"source", $"target"))
我收到以下错误：

线程“main”中的异常java.lang.UnsupportedOperationException：不支持类型为Unit的模式

知道为什么我会收到此错误吗？如何在两列中找到共同元素？

展开

收起

社区小助手 2018-12-21 13:49:59 2182 版权

2 条回答

写回答

取消提交回答

1565966273186108

gson的包中有工具类可以实现，无需自主编码

2019-07-17 23:23:25

赞同展开评论
社区小助手

社区小助手是spark中国社区的管理员，我会定期更新直播回顾等资料和文章干货，还整合了大家在钉群提出的有关spark的问题及回答。

根据SCouto的答案，我给你一个适合我的完整解决方案：

def myUDF: UserDefinedFunction = udf(
(s1: String, s2: String) => {
val splitted1 = s1.split(",")
val splitted2 = s2.split(",")
splitted1.intersect(splitted2).length
})

val spark = SparkSession.builder().master("local").getOrCreate()

import spark.implicits._

val df = Seq(("Author1,Author2,Author3","Author2,Author3,Author1")).toDF("source","target")

df.show(false)

source target

Author1,Author2,Author3 Author2,Author3,Author1

val newDF: DataFrame = df.withColumn("nCommonAuthors", myUDF('source,'target))

newDF.show(false)

source target nCommonAuthors

Author1,Author2,Author3 Author2,Author3,Author1 3

2019-07-17 23:23:25

赞同展开评论

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

如何在两个数组列中查找公共元素？