一、使用Spark SQL完成任务1里面的数据筛选
先是创建dataframe数据:
import pandas as pd from pyspark.sql import SparkSession # 创建spark应用 spark = SparkSession.builder.appName('mypyspark').getOrCreate() # 用python链接spark环境 # 创建dataframe数据 test = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63), ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99), ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)], ['number','class','language','math','english','physic','chemical']) test.createOrReplaceTempView('table') spark.sql('select * from table').show()
其实和sql基本语法一样,没啥讲的,基本的select
、where
、count
操作:
# 数据表的行数 spark.sql('select count(*) as row_num from table').show() # 数据的列数 spark.sql('show columns from table').count() # 7 # 用spark筛选class为1的样本 sql_seq = 'select * from table where class = 1' spark.sql(sql_seq).show() # 用spark筛选language>90或math>90的样本 sql_seq = 'select * from table where language > 90 or math >90' spark.sql(sql_seq).show()
结果为:
+-------+ |row_num| +-------+ | 9| +-------+ +------+-----+--------+----+-------+------+--------+ |number|class|language|math|english|physic|chemical| +------+-----+--------+----+-------+------+--------+ | 001| 1| 100| 87| 67| 83| 98| | 005| 1| 76| 62| 89| 81| 98| | 008| 1| 55| 62| 46| 78| 71| +------+-----+--------+----+-------+------+--------+ +------+-----+--------+----+-------+------+--------+ |number|class|language|math|english|physic|chemical| +------+-----+--------+----+-------+------+--------+ | 001| 1| 100| 87| 67| 83| 98| | 003| 3| 86| 91| 83| 89| 63| +------+-----+--------+----+-------+------+--------+
二、使用Spark SQL完成任务2里面的统计(列可以不统计)
# 分析每列的类型,取值个数 sql_seq = 'describe table' spark.sql(sql_seq).show() # 看Type2列的的取值情况 sql_seq = 'select distinct `language` from table' spark.sql(sql_seq).show() # 分析每列是否含有缺失值(个数),此处木有 sql_seq = 'select * from table where `class` is null' spark.sql(sql_seq).show()
结果如下,注意列的类型,如果是字符串则需要左右加上两个点。
+--------+---------+-------+ |col_name|data_type|comment| +--------+---------+-------+ | number| string| null| | class| string| null| |language| bigint| null| | math| bigint| null| | english| bigint| null| | physic| bigint| null| |chemical| bigint| null| +--------+---------+-------+ +--------+ |language| +--------+ | 87| | 100| | 65| | 86| | 84| | 76| | 63| | 56| | 55| +--------+ +------+-----+--------+----+-------+------+--------+ |number|class|language|math|english|physic|chemical| +------+-----+--------+----+-------+------+--------+ +------+-----+--------+----+-------+------+--------+
三、使用Spark SQL完成任务3的分组统计
这里用另一个数据集(Pokemon):
import pandas as pd from pyspark.sql import SparkSession # 创建spark应用 spark = SparkSession.builder.appName('mypyspark').getOrCreate() # 用python链接spark环境 from pyspark import SparkFiles spark.sparkContext.addFile('https://cdn.coggle.club/Pokemon.csv') #path = "file://"+SparkFiles.get("Pokemon.csv") # 将读取的进行保存,表头也需要保存 #df = spark.read.csv(path=path, header=True, inferSchema= True) df = spark.read.csv(SparkFiles.get('Pokemon.csv'), header = True, inferSchema = True) df.show(5) ###### 数据集如下 ############### +--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+ | Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp. Atk|Sp. Def|Speed|Generation|Legendary| +--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+ | Bulbasaur| Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| | Ivysaur| Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| | Venusaur| Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| |VenusaurMega Venu...| Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false| | Charmander| Fire| null| 309| 39| 52| 43| 60| 50| 65| 1| false| +--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+ only showing top 5 rows
分组聚合:
# 分组统计,使用Spark SQL完成任务3的分组统计——统计数据在Type 1分组下 HP的均值 sql_seq = 'select `Type 2`, avg(HP) from table2 group by `Type 2`' spark.sql(sql_seq).show() #### 结果如下 ######## +--------+------------------+ | Type 2| avg(HP)| +--------+------------------+ | Water|62.714285714285715| | Poison| 58.76470588235294| | Steel| 64.63636363636364| | Rock| 68.07142857142857| | Ice| 90.0| | null| 67.76683937823834| | Ghost|59.142857142857146| | Fairy| 64.30434782608695| | Psychic| 72.21212121212122| | Dragon| 82.16666666666667| | Flying| 71.3917525773196| | Bug|53.333333333333336| |Electric| 88.16666666666667| | Fire| 71.25| | Ground| 77.22857142857143| | Dark| 75.55| |Fighting| 79.46153846153847| | Grass| 62.64| | Normal| 63.5| +--------+------------------+