假定所有操作都事先导入pandas
import pandas
我们以一个csv文件为例,来展示pandas是如何读取数据的:food_info
读入csv文件
food_info = pandas.read_csv("food_info.csv")
1、查看pandas的数据结构,pandas的数据结构为DataFrame类型
print(type(food_info))
OUT:
<class 'pandas.core.frame.DataFrame'>
2、查看pandas中的数据类型,pandas的数据类型包括int、float、object、datatime、bool,其中object指的是string值
print(food_info.dtypes)
OUT:
NDB_No int64 Shrt_Desc object Water_(g) float64 Energ_Kcal int64 Protein_(g) float64 Lipid_Tot_(g) float64 ………………………… Cholestrl_(mg) float64 dtype: object
3、查看官方帮助文档
print(help(pandas.read_csv))
OUT:
Help on function read_csv in module pandas.io.parsers: ………………………… Returns ------- result : DataFrame or TextParser None
4、①查看DataFrame的前5行,如果有参数,则显示参数n表示的前n行
food_info.head()
OUT:
②查看DataFrame的后5行,如果有参数,则显示参数n表示的后n行
food_info.tail()
OUT:
5、返回每列的列名
print(food_info.columns)
OUT:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)'], dtype='object')
6、查看DataFrame的行列情况
print(food_info.shape)
OUT:
(8618, 36)
7、查看指定索引行的数据
print(food_info.loc[0])
OUT:
NDB_No 1001 Shrt_Desc BUTTER WITH SALT Water_(g) 15.87 Energ_Kcal 717 Protein_(g) 0.85 Lipid_Tot_(g) 81.11 Ash_(g) 2.11 Carbohydrt_(g) 0.06 Fiber_TD_(g) 0 Sugar_Tot_(g) 0.06 Calcium_(mg) 24 Iron_(mg) 0.02 Magnesium_(mg) 2 Phosphorus_(mg) 24 Potassium_(mg) 24 Sodium_(mg) 643 Zinc_(mg) 0.09 Copper_(mg) 0 Manganese_(mg) 0 Selenium_(mcg) 1 Vit_C_(mg) 0 Thiamin_(mg) 0.005 Riboflavin_(mg) 0.034 Niacin_(mg) 0.042 Vit_B6_(mg) 0.003 Vit_B12_(mcg) 0.17 Vit_A_IU 2499 Vit_A_RAE 684 Vit_E_(mg) 2.32 Vit_D_mcg 1.5 Vit_D_IU 60 Vit_K_(mcg) 7 FA_Sat_(g) 51.368 FA_Mono_(g) 21.021 FA_Poly_(g) 3.043 Cholestrl_(mg) 215 Name: 0, dtype: object
8、返回索引行切片值
print(food_info.loc[3:5])
OUT:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \ 3 1004 CHEESE BLUE 42.41 353 21.40 28.74 4 1005 CHEESE BRICK 41.11 371 23.24 29.68 5 1006 CHEESE BRIE 48.42 334 20.75 27.68 Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... \ 3 5.11 2.34 0.0 0.50 ... 4 3.18 2.79 0.0 0.51 ... 5 2.70 0.45 0.0 0.45 ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) \ 3 721.0 198.0 0.25 0.5 21.0 2.4 4 1080.0 292.0 0.26 0.5 22.0 2.5 5 592.0 174.0 0.24 0.5 20.0 2.3 FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg) 3 18.669 7.778 0.800 75.0 4 18.764 8.598 0.784 94.0 5 17.410 8.013 0.826 100.0 [3 rows x 36 columns]
9、返回某几行的索引值
two_five_ten = [2, 5, 10] print(food_info.loc[two_five_ten])
OUT:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \ 2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 5 1006 CHEESE BRIE 48.42 334 20.75 10 1011 CHEESE COLBY 38.20 394 23.76 Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \ 2 99.48 0.00 0.00 0.0 0.00 5 27.68 2.70 0.45 0.0 0.45 10 32.11 3.36 2.57 0.0 0.52 ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \ 2 ... 3069.0 840.0 2.80 1.8 73.0 5 ... 592.0 174.0 0.24 0.5 20.0 10 ... 994.0 264.0 0.28 0.6 24.0 Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg) 2 8.6 61.924 28.732 3.694 256.0 5 2.3 17.410 8.013 0.826 100.0 10 2.7 20.218 9.280 0.953 95.0 [3 rows x 36 columns]
10、①查看DataFrame的某一列
food_info["NDB_No"]
OUT:
0 1001 1 1002 2 1003 3 1004 4 1005 5 1006 6 1007 7 1008 8 1009 9 1010 10 1011 11 1012 12 1013 13 1014 14 1015 15 1016 16 1017 17 1018 18 1019 19 1020 20 1021 21 1022 22 1023 23 1024 24 1025 25 1026 26 1027 27 1028 28 1029 29 1030 ... 8588 43544 8589 43546 8590 43550 8591 43566 8592 43570 8593 43572 8594 43585 8595 43589 8596 43595 8597 43597 8598 43598 8599 44005 8600 44018 8601 44048 8602 44055 8603 44061 8604 44074 8605 44110 8606 44158 8607 44203 8608 44258 8609 44259 8610 44260 8611 48052 8612 80200 8613 83110 8614 90240 8615 90480 8616 90560 8617 93600 Name: NDB_No, Length: 8618, dtype: int64
②查看多列
col = ["Ash_(g)", "Fiber_TD_(g)"] food_info[col]
OUT:
Ash_(g) Fiber_TD_(g) 0 2.11 0.0 1 2.11 0.0 2 0.00 0.0 3 5.11 0.0 4 3.18 0.0 5 2.70 0.0 6 3.68 0.0 7 3.28 0.0 8 3.71 0.0 9 3.60 0.0 10 3.36 0.0 11 1.41 0.0 12 1.20 0.2 13 1.71 0.0 14 1.27 0.0 15 1.39 0.0 16 1.32 0.0 17 4.22 0.0 18 5.20 0.0 19 3.79 0.0 20 4.75 0.0 21 3.94 0.0 22 4.30 0.0 23 3.79 0.0 24 3.55 0.0 25 3.28 0.0 26 2.91 0.0 27 3.27 0.0 28 3.80 0.0 29 3.66 0.0 ... ... ... 8588 2.00 2.6 8589 0.76 1.6 8590 0.29 1.0 8591 1.85 5.7 8592 1.22 4.2 8593 1.71 14.2 8594 0.52 2.0 8595 3.50 0.0 8596 0.80 2.1 8597 2.40 0.0 8598 0.40 0.0 8599 0.00 0.0 8600 0.00 0.1 8601 4.74 0.0 8602 13.90 27.8 8603 9.90 6.1 8604 0.22 0.1 8605 0.08 0.8 8606 0.35 2.6 8607 0.07 0.0 8608 5.70 10.1 8609 1.86 0.9 8610 6.80 0.8 8611 1.00 0.6 8612 1.40 0.0 8613 13.40 0.0 8614 2.97 0.0 8615 0.86 0.0 8616 1.30 0.0 8617 1.20 0.0 8618 rows × 2 columns
11、找出food_info文件中单位是g的数据
col_name = food_info.columns.tolist() print(col_name) print("__________") gram_columns = [] for c in col_name: #endswith() 方法用于判断字符串是否以指定后缀结尾,如果以指定后缀结尾返回True,否则返回False if c.endswith("(g)"): gram_columns.append(c) gram_df = food_info[gram_columns] print(gram_df.head())
OUT:
['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)'] __________ Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \ 0 15.87 0.85 81.11 2.11 0.06 1 15.87 0.85 81.11 2.11 0.06 2 0.24 0.28 99.48 0.00 0.00 3 42.41 21.40 28.74 5.11 2.34 4 41.11 23.24 29.68 3.18 2.79 Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) 0 0.0 0.06 51.368 21.021 3.043 1 0.0 0.06 50.489 23.426 3.012 2 0.0 0.00 61.924 28.732 3.694 3 0.0 0.50 18.669 7.778 0.800 4 0.0 0.51 18.764 8.598 0.784
12、升序排序
food_info.sort_values("Sodium_(mg)", inplace = True, ascending = True) print(food_info["Sodium_(mg)"])
参数解释:
inplace:是否新生成一个DataFrame。
ascending:是否升序排序
OUT:
760 0.0 8607 0.0 629 0.0 631 0.0 6470 0.0 654 0.0 8599 0.0 657 0.0 633 0.0 635 0.0 637 0.0 638 0.0 639 0.0 646 0.0 653 0.0 632 0.0 606 0.0 6463 0.0 634 0.0 666 0.0 8387 0.0 611 0.0 434 0.0 655 0.0 661 0.0 3663 0.0 3664 0.0 3665 0.0 656 0.0 3697 0.0 ... 8153 NaN 8155 NaN 8156 NaN 8157 NaN 8158 NaN 8159 NaN 8160 NaN 8161 NaN 8163 NaN 8164 NaN 8165 NaN 8167 NaN 8169 NaN 8170 NaN 8172 NaN 8173 NaN 8174 NaN 8175 NaN 8176 NaN 8177 NaN 8178 NaN 8179 NaN 8180 NaN 8181 NaN 8183 NaN 8184 NaN 8185 NaN 8195 NaN 8251 NaN 8267 NaN Name: Sodium_(mg), Length: 8618, dtype: float64