我有一个“文本”列,其中存储了令牌数组。如何过滤所有这些数组,使令牌长度至少为三个字母?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
Had tried this but have TypeError: Column is not iterable
df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
in col('text')], ''))
df_clean.show()
我希望看到:
id | text
1 | [good]
2 | [You, are]
这样做,您可以决定是否排除行,我添加了一个额外的列并过滤掉了,
from pyspark.sql import functions as f
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
(3, ['ok'])
]
df = spark.createDataFrame(vals, columns)
df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()
收益:
id | text | text_left_over |
---|---|---|
1 | [I, am, good] | [good] |
2 | [You, are, ok] | [You, are] |
3 | [ok] | [] |
id | text_left_over |
---|---|
1 | [good] |
2 | [You, are] |
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。