如何使用Apache Spark Dataframe编写IN和NOT IN

我在SQL中有以下2个SQL查询示例：

a) update DBTABLE1

set col1 = 'Yes'

where ID IN ( '100' ) and City = any(select City from DBTable2 where Country = 'USA');

b) update DBTABLE1

set col2 = 'No'

where ID NOT IN ( '100' ) and City = any(select City from DBTable2 where Country = 'USA');

How to write above 2 SQLs using Apache Spark Dataframes (Not Select subquery etc). A dataframe is already having these 2 columns - col1 and col2, I am changing their values using WITHCOLUMN and WHEN clause.
CitiDF包含多个城市的数据集。

I tried below but giving compile errors:

c) This is for (a) above:

withcolumn(col("col1"),when(col("id") === lit("100")
and col("city").isin(CitiDF("city")), lit("yes")))

d) This is for (b) above:

withcolumn(col("col2"),when(col("id") === lit("100")
and ! (col("city").isin(CitiDF("city"))), lit("yes")))

为了使事情更具体，让我们考虑一些小型数据库。我们有一个名为DataFrame df，如下所示：

id	city	col1	col2
100	Frankfurt	filler	filler
200	Berlin	filler	filler
100	Vienna	filler	filler
500	Victoria	filler	filler
600	Shanghai	filler	filler
100	Cologne	filler	filler

和另一个名字cities，看起来像这样：

cityName
Frankfurt
Vienna

我们可以这样做你的查询：

val cityList = cities.collect.map(x => x(0))
val df1 = df.withColumn("col1", when($"id" === "100" and $"city".isin(cityList: _*), "yes"))
我们得到的结果是：

id	city	col1	col2
100	Frankfurt	yes	filler
200	Berlin	null	filler
100	Vienna	yes	filler
500	Victoria	null	filler
600	Shanghai	null	filler
100	Cologne	null	filler

对于第二个查询，我们使用相同的cityList：

val df2 = df.withColumn("col2", when($"id" === "100" and !$"city".isin(cityList: _*), "yes"))
给我们

id	city	col1	col2
100	Frankfurt	filler	null
200	Berlin	filler	null
100	Vienna	filler	null
500	Victoria	filler	null
600	Shanghai	filler	null
100	Cologne	filler	yes

但是，这种方法有一个很大的警告。如果城市数量很大，您可以通过收集所有名称来耗尽内存。相反，我会考虑使用替代方法，如外连接：

df.join(cities, df("city") === cities("cityName"), "outer").
withColumn("col1", when($"cityName".isNotNull and $"id" === "100", "yes")).
withColumn("col2", when($"cityName".isNull and $"id" === "100", "yes")).
drop("cityName")
给我们

id	city	col1	col2
100	Frankfurt	yes	null
500	Victoria	null	null
200	Berlin	null	null
100	Vienna	yes	null
100	Cologne	null	yes
600	Shanghai	null	null

是的，它引入了一个额外的列，但只是暂时的，并且避免将潜在的大型城市列表拉入驱动程序的内存中。

如何使用Apache Spark Dataframe编写IN和NOT IN

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

如何使用Apache Spark Dataframe编写IN和NOT IN

相关课程

相关文章

相关电子书

相关镜像