Pandas 2.2 中文官方教程和指南(五)(2)https://developer.aliyun.com/article/1510584
选择列
SAS 提供了在DATA
步骤中选择、删除和重命名列的关键字。
data tips; set tips; keep sex total_bill tip; run; data tips; set tips; drop sex; run; data tips; set tips; rename total_bill=total_bill_2; run;
下面以 pandas 表达了相同的操作。
保留特定列
In [1]: tips[["sex", "total_bill", "tip"]] Out[1]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除一列
In [2]: tips.drop("sex", axis=1) Out[2]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名一列
In [3]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[3]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
按值排序
在 SAS 中,通过PROC SORT
来实现排序
proc sort data=tips; by sex total_bill; run;
pandas 有一个DataFrame.sort_values()
方法,可以按列排序。
In [1]: tips = tips.sort_values(["sex", "total_bill"]) In [2]: tips Out[2]: total_bill tip sex smoker day time size 67 1.07 1.00 Female Yes Sat Dinner 1 92 3.75 1.00 Female Yes Fri Dinner 2 111 5.25 1.00 Female No Sat Dinner 1 145 6.35 1.50 Female No Thur Lunch 2 135 6.51 1.25 Female No Thur Lunch 2 .. ... ... ... ... ... ... ... 182 43.35 3.50 Male Yes Sun Dinner 3 156 46.17 5.00 Male No Sun Dinner 6 59 46.27 6.73 Male No Sat Dinner 4 212 46.33 9.00 Male No Sat Dinner 4 170 48.81 10.00 Male Yes Sat Dinner 3 [244 rows x 7 columns]
列上的操作
在DATA
步骤中,可以对新列或现有列使用任意数学表达式。
data tips; set tips; total_bill = total_bill - 2; new_bill = total_bill / 2; run;
pandas 通过在DataFrame
中指定单独的Series
来提供矢量化操作。新列可以以相同的方式分配。DataFrame.drop()
方法从DataFrame
中删除一列。
In [1]: tips["total_bill"] = tips["total_bill"] - 2 In [2]: tips["new_bill"] = tips["total_bill"] / 2 In [3]: tips Out[3]: total_bill tip sex smoker day time size new_bill 0 14.99 1.01 Female No Sun Dinner 2 7.495 1 8.34 1.66 Male No Sun Dinner 3 4.170 2 19.01 3.50 Male No Sun Dinner 3 9.505 3 21.68 3.31 Male No Sun Dinner 2 10.840 4 22.59 3.61 Female No Sun Dinner 4 11.295 .. ... ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 13.515 240 25.18 2.00 Female Yes Sat Dinner 2 12.590 241 20.67 2.00 Male Yes Sat Dinner 2 10.335 242 15.82 1.75 Male No Sat Dinner 2 7.910 243 16.78 3.00 Female No Thur Dinner 2 8.390 [244 rows x 8 columns] In [4]: tips = tips.drop("new_bill", axis=1)
过滤
在 SAS 中,使用if
或where
语句在一个或多个列上进行过滤。
data tips; set tips; if total_bill > 10; run; data tips; set tips; where total_bill > 10; /* equivalent in this case - where happens before the DATA step begins and can also be used in PROC statements */ run;
DataFrames 可以以多种方式进行过滤;其中最直观的是使用布尔索引。
In [1]: tips[tips["total_bill"] > 10] Out[1]: total_bill tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 5 23.29 4.71 Male No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [204 rows x 7 columns]
上述语句只是将True
/False
对象的Series
传递给 DataFrame,返回所有具有True
的行。
In [2]: is_dinner = tips["time"] == "Dinner" In [3]: is_dinner Out[3]: 0 True 1 True 2 True 3 True 4 True ... 239 True 240 True 241 True 242 True 243 True Name: time, Length: 244, dtype: bool In [4]: is_dinner.value_counts() Out[4]: time True 176 False 68 Name: count, dtype: int64 In [5]: tips[is_dinner] Out[5]: total_bill tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [176 rows x 7 columns]
如果/那么逻辑
在 SAS 中,可以使用如果/那么逻辑来创建新列。
data tips; set tips; format bucket $4.; if total_bill < 10 then bucket = 'low'; else bucket = 'high'; run;
在 pandas 中,可以使用numpy
的where
方法来实现相同的操作。
In [1]: tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high") In [2]: tips Out[2]: total_bill tip sex smoker day time size bucket 0 14.99 1.01 Female No Sun Dinner 2 high 1 8.34 1.66 Male No Sun Dinner 3 low 2 19.01 3.50 Male No Sun Dinner 3 high 3 21.68 3.31 Male No Sun Dinner 2 high 4 22.59 3.61 Female No Sun Dinner 4 high .. ... ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 high 240 25.18 2.00 Female Yes Sat Dinner 2 high 241 20.67 2.00 Male Yes Sat Dinner 2 high 242 15.82 1.75 Male No Sat Dinner 2 high 243 16.78 3.00 Female No Thur Dinner 2 high [244 rows x 8 columns]
日期功能
SAS 提供了各种函数来对日期/时间列进行操作。
data tips; set tips; format date1 date2 date1_plusmonth mmddyy10.; date1 = mdy(1, 15, 2013); date2 = mdy(2, 15, 2015); date1_year = year(date1); date2_month = month(date2); * shift date to beginning of next interval; date1_next = intnx('MONTH', date1, 1); * count intervals between dates; months_between = intck('MONTH', date1, date2); run;
下面显示了等效的 pandas 操作。除了这些功能外,pandas 还支持 Base SAS 中不可用的其他时间序列功能(例如重新采样和自定义偏移)-请参阅 timeseries 文档了解更多详情。
In [1]: tips["date1"] = pd.Timestamp("2013-01-15") In [2]: tips["date2"] = pd.Timestamp("2015-02-15") In [3]: tips["date1_year"] = tips["date1"].dt.year In [4]: tips["date2_month"] = tips["date2"].dt.month In [5]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin() In [6]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[ ...: "date1" ...: ].dt.to_period("M") ...: In [7]: tips[ ...: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] ...: ] ...: Out[7]: date1 date2 date1_year date2_month date1_next months_between 0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> .. ... ... ... ... ... ... 239 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 240 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 241 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 242 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 243 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> [244 rows x 6 columns]
选择列
SAS 提供了在DATA
步骤中选择、删除和重命名列的关键字。
data tips; set tips; keep sex total_bill tip; run; data tips; set tips; drop sex; run; data tips; set tips; rename total_bill=total_bill_2; run;
下面以 pandas 表达了相同的操作。
保留特定列
In [1]: tips[["sex", "total_bill", "tip"]] Out[1]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除一列
In [2]: tips.drop("sex", axis=1) Out[2]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名一列
In [3]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[3]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
保留特定列
In [1]: tips[["sex", "total_bill", "tip"]] Out[1]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除一列
In [2]: tips.drop("sex", axis=1) Out[2]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名一列
In [3]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[3]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
按值排序
在 SAS 中,通过PROC SORT
来实现排序
proc sort data=tips; by sex total_bill; run;
pandas 有一个DataFrame.sort_values()
方法,可以按列排序。
In [1]: tips = tips.sort_values(["sex", "total_bill"]) In [2]: tips Out[2]: total_bill tip sex smoker day time size 67 1.07 1.00 Female Yes Sat Dinner 1 92 3.75 1.00 Female Yes Fri Dinner 2 111 5.25 1.00 Female No Sat Dinner 1 145 6.35 1.50 Female No Thur Lunch 2 135 6.51 1.25 Female No Thur Lunch 2 .. ... ... ... ... ... ... ... 182 43.35 3.50 Male Yes Sun Dinner 3 156 46.17 5.00 Male No Sun Dinner 6 59 46.27 6.73 Male No Sat Dinner 4 212 46.33 9.00 Male No Sat Dinner 4 170 48.81 10.00 Male Yes Sat Dinner 3 [244 rows x 7 columns]
字符串处理
查找字符串的长度
SAS 使用 LENGTHN 和 LENGTHC 函数确定字符字符串的长度。LENGTHN
排除尾随空格,LENGTHC
包括尾随空格。
data _null_; set tips; put(LENGTHN(time)); put(LENGTHC(time)); run;
你可以使用 Series.str.len()
找到字符串的长度。在 Python 3 中,所有字符串都是 Unicode 字符串。len
包括尾随空格。使用 len
和 rstrip
排除尾随空格。
In [1]: tips["time"].str.len() Out[1]: 67 6 92 6 111 6 145 5 135 5 .. 182 6 156 6 59 6 212 6 170 6 Name: time, Length: 244, dtype: int64 In [2]: tips["time"].str.rstrip().str.len() Out[2]: 67 6 92 6 111 6 145 5 135 5 .. 182 6 156 6 59 6 212 6 170 6 Name: time, Length: 244, dtype: int64
查找子串的位置
SAS 使用 FINDW 函数确定字符串中字符的位置。FINDW
使用第一个参数定义的字符串,并搜索你提供的第二个参数作为子串的第一个位置。
data _null_; set tips; put(FINDW(sex,'ale')); run;
你可以使用 Series.str.find()
方法在字符串列中找到字符的位置。find
搜索子串的第一个位置。如果找到子串,则该方法返回其位置。如果未找到,则返回-1
。请记住,Python 的索引是从零开始的。
In [1]: tips["sex"].str.find("ale") Out[1]: 67 3 92 3 111 3 145 3 135 3 .. 182 1 156 1 59 1 212 1 170 1 Name: sex, Length: 244, dtype: int64
根据位置提取子串
SAS 使用 SUBSTR 函数根据位置从字符串中提取子串。
data _null_; set tips; put(substr(sex,1,1)); run;
使用 pandas,你可以使用[]
符号通过位置来提取字符串中的子串。请记住,Python 的索引是从零开始的。
In [1]: tips["sex"].str[0:1] Out[1]: 67 F 92 F 111 F 145 F 135 F .. 182 M 156 M 59 M 212 M 170 M Name: sex, Length: 244, dtype: object
提取第 n 个单词
SAS 的 SCAN 函数从字符串中返回第 n 个单词。第一个参数是要解析的字符串,第二个参数指定要提取的单词。
data firstlast; input String $60.; First_Name = scan(string, 1); Last_Name = scan(string, -1); datalines2; John Smith; Jane Cook; ;;; run;
在 pandas 中提取单词的最简单方法是通过空格将字符串分割,然后按索引引用单词。请注意,如果需要,还有更强大的方法。
In [1]: firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) In [2]: firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0] In [3]: firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1] In [4]: firstlast Out[4]: String First_Name Last_Name 0 John Smith John Smith 1 Jane Cook Jane Cook
更改大小写
SAS 的 UPCASE、LOWCASE 和 PROPCASE 函数更改参数的大小写。
data firstlast; input String $60.; string_up = UPCASE(string); string_low = LOWCASE(string); string_prop = PROPCASE(string); datalines2; John Smith; Jane Cook; ;;; run;
等效的 pandas 方法是Series.str.upper()
、Series.str.lower()
和Series.str.title()
。
In [1]: firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]}) In [2]: firstlast["upper"] = firstlast["string"].str.upper() In [3]: firstlast["lower"] = firstlast["string"].str.lower() In [4]: firstlast["title"] = firstlast["string"].str.title() In [5]: firstlast Out[5]: string upper lower title 0 John Smith JOHN SMITH john smith John Smith 1 Jane Cook JANE COOK jane cook Jane Cook
Pandas 2.2 中文官方教程和指南(五)(4)https://developer.aliyun.com/article/1510586