我有一个带有'text'列的数据框,其中有许多行包含英文句子。
文本
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
我有一个List类型的变量,它有一些单词,如
val removeList = List("Hello", "evening", "because", "is")
我想删除removeList中存在的列文本中的所有单词。
所以我的输出应该是
It
Good morning
everyone
What your name
I'll see you tomorrow
如何使用Spark Scala执行此操作。
我写了一个像这样的代码:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
}
}
}
但我收到了错误
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
未指定的值参数证据$ 6。val df4 = df3.map(x => cleanText(x.mkString,stopWordsList));
检查这个df和rdd方式。
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAsString ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\b"+t+"\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
输出:
data | new1 |
---|---|
It is evening | It |
Good morning | Good morning |
Hello everyone | everyone |
What is your name | What your name |
I'll see you tomorrow | I'll see you tomorrow |
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。