DataFrame它不是Spark SQL提出的,而是早起在R、Pandas语言就已经有了的。
A Dataset is a distributed collection of data:分布式的数据集
A DataFrame is a Dataset organized into named columns.
以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称
dataframe 和 rdd
DataFrame是一种以RDD为基础的分布式数据集,类似于传统数据库中的二维表格。DataFrame引入了schema。
API常用操作
package org.example import org.apache.spark.sql.SparkSession object DataFrameApp { def main(args: Array[String]) { val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate() // 将json文件加载成一个dataframe val peopleDF = spark.read.format("json").load("people.json") // 输出dataframe对应的schema信息 peopleDF.printSchema() // 输出数据集的前20条记录 peopleDF.show() //查询某列所有的数据: select name from table peopleDF.select("name").show() // 查询某几列所有的数据,并对列进行计算: select name, age+10 as age2 from table peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 10).as("age2")).show() //根据某一列的值进行过滤: select * from table where age>19 peopleDF.filter(peopleDF.col("age") > 19).show() //根据某一列进行分组,然后再进行聚合操作: select age,count(1) from table group by age peopleDF.groupBy("age").count().show() spark.stop() } }
root |-- age: long (nullable = true) |-- name: string (nullable = true) +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ +-------+ | name| +-------+ |Michael| | Andy| | Justin| +-------+ +-------+----+ | name|age2| +-------+----+ |Michael|null| | Andy| 40| | Justin| 29| +-------+----+ +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+ +----+-----+ | age|count| +----+-----+ | 19| 1| |null| 1| | 30| 1| +----+-----+
DataFrame和RDD
package org.example import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} object DataFrameRDDApp { def main(args: Array[String]) { val spark = SparkSession.builder().appName("DataFrameRDDApp").master("local[2]").getOrCreate() //inferReflection(spark) program(spark) spark.stop() } def program(spark: SparkSession): Unit = { // RDD ==> DataFrame val rdd = spark.sparkContext.textFile("infos.txt") val infoRDD = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt)) val structType = StructType(Array(StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("age", IntegerType, true))) val infoDF = spark.createDataFrame(infoRDD, structType) infoDF.printSchema() infoDF.show() //通过df的api进行操作 infoDF.filter(infoDF.col("age") > 70).show //通过sql的方式进行操作 infoDF.createOrReplaceTempView("infos") spark.sql("select * from infos where age > 70").show() } def inferReflection(spark: SparkSession) { // RDD ==> DataFrame val rdd = spark.sparkContext.textFile("infos.txt") //注意:需要导入隐式转换 import spark.implicits._ val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF() infoDF.show() infoDF.filter(infoDF.col("age") > 70).show infoDF.createOrReplaceTempView("infos") spark.sql("select * from infos where age > 70").show() } case class Info(id: Int, name: String, age: Int) }
root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true) +---+-----+---+ | id| name|age| +---+-----+---+ | 1|hello| 66| | 2|world| 89| | 3|spark| 88| +---+-----+---+ +---+-----+---+ | id| name|age| +---+-----+---+ | 2|world| 89| | 3|spark| 88| +---+-----+---+ +---+-----+---+ | id| name|age| +---+-----+---+ | 2|world| 89| | 3|spark| 88| +---+-----+---+
案例
package org.example import org.apache.spark.sql.SparkSession /** * DataFrame中的操作操作 */ object DataFrameProjectApp { def main(args: Array[String]) { val spark = SparkSession.builder().appName("DataFrameProjectApp").master("local[2]").getOrCreate() // RDD ==> DataFrame val rdd = spark.sparkContext.textFile("Student.data") //注意:需要导入隐式转换 import spark.implicits._ val studentDF = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF() //show默认只显示前20条 studentDF.show studentDF.show(30) studentDF.show(30, false) studentDF.take(10) studentDF.first() studentDF.head(3) studentDF.select("email").show(30,false) studentDF.filter("name=''").show studentDF.filter("name='' OR name='NULL'").show //name以M开头的人 studentDF.filter("SUBSTR(name,0,1)='M'").show studentDF.sort(studentDF("name")).show studentDF.sort(studentDF("name").desc).show studentDF.sort("name","id").show studentDF.sort(studentDF("name").asc, studentDF("id").desc).show studentDF.select(studentDF("name").as("student_name")).show val studentDF2 = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF() studentDF.join(studentDF2, studentDF.col("id") === studentDF2.col("id")).show spark.stop() } case class Student(id: Int, name: String, phone: String, email: String) }
+---+--------+--------------+--------------------+ | id| name| phone| email| +---+--------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| +---+--------+--------------+--------------------+ only showing top 20 rows | id| name| phone| email| +---+--------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 21| |1-711-710-6552|lectus@aliquetlib...| | 22| |1-711-710-6552|lectus@aliquetlib...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| +---+--------+--------------+--------------------+ +---+--------+--------------+-----------------------------------------+ |id |name |phone |email | +---+--------+--------------+-----------------------------------------+ |1 |Burke |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk| |2 |Kamal |1-668-571-5046|pede.Suspendisse@interdumenim.edu | |3 |Olga |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu | |4 |Belle |1-246-894-6340|vitae.aliquet.nec@neque.co.uk | |5 |Trevor |1-300-527-4967|dapibus.id@acturpisegestas.net | |6 |Laurel |1-691-379-9921|adipiscing@consectetueripsum.edu | |7 |Sara |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu | |8 |Kaseem |1-881-586-2689|cursus.et.magna@euismod.org | |9 |Lev |1-916-367-5608|Vivamus.nisi@ipsumdolor.com | |10 |Maya |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu | |11 |Emi |1-467-270-1337|est@nunc.com | |12 |Caleb |1-683-212-0896|Suspendisse@Quisque.edu | |13 |Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca | |14 |Anika |1-856-828-7883|euismod@ligulaelit.co.uk | |15 |Tarik |1-398-171-2268|turpis@felisorci.com | |16 |Amena |1-878-250-3129|lorem.luctus.ut@scelerisque.com | |17 |Blossom |1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk | |18 |Guy |1-869-521-3230|senectus.et.netus@lectusrutrum.com | |19 |Malachi |1-608-637-2772|Proin.mi.Aliquam@estarcu.net | |20 |Edward |1-711-710-6552|lectus@aliquetlibero.co.uk | |21 | |1-711-710-6552|lectus@aliquetlibero.co.uk | |22 | |1-711-710-6552|lectus@aliquetlibero.co.uk | |23 |NULL |1-711-710-6552|lectus@aliquetlibero.co.uk | +---+--------+--------------+-----------------------------------------+ +-----------------------------------------+ |email | +-----------------------------------------+ |ullamcorper.velit.in@ametnullaDonec.co.uk| |pede.Suspendisse@interdumenim.edu | |Aenean.eget.metus@dictumcursusNunc.edu | |vitae.aliquet.nec@neque.co.uk | |dapibus.id@acturpisegestas.net | |adipiscing@consectetueripsum.edu | |Donec.nibh@enimEtiamimperdiet.edu | |cursus.et.magna@euismod.org | |Vivamus.nisi@ipsumdolor.com | |accumsan.convallis@ornarelectusjusto.edu | |est@nunc.com | |Suspendisse@Quisque.edu | |sit.amet.dapibus@lacusAliquamrutrum.ca | |euismod@ligulaelit.co.uk | |turpis@felisorci.com | |lorem.luctus.ut@scelerisque.com | |Nunc.commodo.auctor@eratSed.co.uk | |senectus.et.netus@lectusrutrum.com | |Proin.mi.Aliquam@estarcu.net | |lectus@aliquetlibero.co.uk | |lectus@aliquetlibero.co.uk | |lectus@aliquetlibero.co.uk | |lectus@aliquetlibero.co.uk | +-----------------------------------------+ +---+----+--------------+--------------------+ | id|name| phone| email| +---+----+--------------+--------------------+ | 21| |1-711-710-6552|lectus@aliquetlib...| | 22| |1-711-710-6552|lectus@aliquetlib...| +---+----+--------------+--------------------+ +---+----+--------------+--------------------+ | id|name| phone| email| +---+----+--------------+--------------------+ | 21| |1-711-710-6552|lectus@aliquetlib...| | 22| |1-711-710-6552|lectus@aliquetlib...| | 23|NULL|1-711-710-6552|lectus@aliquetlib...| +---+----+--------------+--------------------+ +---+-------+--------------+--------------------+ | id| name| phone| email| +---+-------+--------------+--------------------+ | 10| Maya|1-271-683-2698|accumsan.convalli...| | 19|Malachi|1-608-637-2772|Proin.mi.Aliquam@...| +---+-------+--------------+--------------------+ +---+--------+--------------+--------------------+ | id| name| phone| email| +---+--------+--------------+--------------------+ | 21| |1-711-710-6552|lectus@aliquetlib...| | 22| |1-711-710-6552|lectus@aliquetlib...| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 11| Emi|1-467-270-1337| est@nunc.com| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| +---+--------+--------------+--------------------+ +---+--------+--------------+--------------------+ | id| name| phone| email| +---+--------+--------------+--------------------+ | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 11| Emi|1-467-270-1337| est@nunc.com| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| +---+--------+--------------+--------------------+ +---+--------+--------------+--------------------+ | id| name| phone| email| +---+--------+--------------+--------------------+ | 21| |1-711-710-6552|lectus@aliquetlib...| | 22| |1-711-710-6552|lectus@aliquetlib...| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 11| Emi|1-467-270-1337| est@nunc.com| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| +---+--------+--------------+--------------------+ +---+--------+--------------+--------------------+ | id| name| phone| email| +---+--------+--------------+--------------------+ | 22| |1-711-710-6552|lectus@aliquetlib...| | 21| |1-711-710-6552|lectus@aliquetlib...| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 11| Emi|1-467-270-1337| est@nunc.com| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| +---+--------+--------------+--------------------+ +------------+ |student_name| +------------+ | Burke| | Kamal| | Olga| | Belle| | Trevor| | Laurel| | Sara| | Kaseem| | Lev| | Maya| | Emi| | Caleb| | Florence| | Anika| | Tarik| | Amena| | Blossom| | Guy| | Malachi| | Edward| +------------+ +---+--------+--------------+--------------------+---+--------+--------------+--------------------+ | id| name| phone| email| id| name| phone| email| +---+--------+--------------+--------------------+---+--------+--------------+--------------------+ | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 22| |1-711-710-6552|lectus@aliquetlib...| 22| |1-711-710-6552|lectus@aliquetlib...| | 1| Burke|1-300-746-8446|ullamcorper.velit...| 1| Burke|1-300-746-8446|ullamcorper.velit...| | 13|Florence|1-603-575-2444|sit.amet.dapibus@...| 13|Florence|1-603-575-2444|sit.amet.dapibus@...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| 6| Laurel|1-691-379-9921|adipiscing@consec...| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| 20| Edward|1-711-710-6552|lectus@aliquetlib...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 23| NULL|1-711-710-6552|lectus@aliquetlib...| 23| NULL|1-711-710-6552|lectus@aliquetlib...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 10| Maya|1-271-683-2698|accumsan.convalli...| 10| Maya|1-271-683-2698|accumsan.convalli...| | 21| |1-711-710-6552|lectus@aliquetlib...| 21| |1-711-710-6552|lectus@aliquetlib...| | 11| Emi|1-467-270-1337| est@nunc.com| 11| Emi|1-467-270-1337| est@nunc.com| +---+--------+--------------+--------------------+---+--------+--------------+--------------------+ only showing top 20 rows
DataFrame = Dataset[Row]
Dataset:强类型 typed case class
DataFrame:弱类型 Row
SQL: seletc name from person; compile ok, result no
DF:
df.select(“name”) compile no
df.select(“nname”) compile ok
DS:ds.map(line => line.itemid) compile no