Pandas 2.2 中文官方教程和指南(六)(2)https://developer.aliyun.com/article/1509741
日期功能
Stata 提供了多种函数来对日期/时间列进行操作。
generate date1 = mdy(1, 15, 2013) generate date2 = date("Feb152015", "MDY") generate date1_year = year(date1) generate date2_month = month(date2) * shift date to beginning of next month generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12 replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12 generate months_between = mofd(date2) - mofd(date1) list date1 date2 date1_year date2_month date1_next months_between
下面显示了 pandas 中的等效操作。除了这些函数外,pandas 还支持其他 Stata 中不可用的时间序列功能(如时区处理和自定义偏移)- 有关更多详细信息,请参阅时间序列文档。
In [20]: tips["date1"] = pd.Timestamp("2013-01-15") In [21]: tips["date2"] = pd.Timestamp("2015-02-15") In [22]: tips["date1_year"] = tips["date1"].dt.year In [23]: tips["date2_month"] = tips["date2"].dt.month In [24]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin() In [25]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[ ....: "date1" ....: ].dt.to_period("M") ....: In [26]: tips[ ....: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] ....: ] ....: Out[26]: date1 date2 date1_year date2_month date1_next months_between 0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> .. ... ... ... ... ... ... 239 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 240 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 241 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 242 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 243 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> [244 rows x 6 columns]
选择列
Stata 提供了关键字来选择、删除和重命名列。
keep sex total_bill tip drop sex rename total_bill total_bill_2
下面展示了 pandas 中的相同操作。
保留特定列
In [27]: tips[["sex", "total_bill", "tip"]] Out[27]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除一列
In [28]: tips.drop("sex", axis=1) Out[28]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名列
In [29]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[29]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
按值排序
在 Stata 中,通过 sort
来实现排序
sort sex total_bill
pandas 有一个 DataFrame.sort_values()
方法,可以按列排序。
In [30]: tips = tips.sort_values(["sex", "total_bill"]) In [31]: tips Out[31]: total_bill tip sex smoker day time size 67 1.07 1.00 Female Yes Sat Dinner 1 92 3.75 1.00 Female Yes Fri Dinner 2 111 5.25 1.00 Female No Sat Dinner 1 145 6.35 1.50 Female No Thur Lunch 2 135 6.51 1.25 Female No Thur Lunch 2 .. ... ... ... ... ... ... ... 182 43.35 3.50 Male Yes Sun Dinner 3 156 46.17 5.00 Male No Sun Dinner 6 59 46.27 6.73 Male No Sat Dinner 4 212 46.33 9.00 Male No Sat Dinner 4 170 48.81 10.00 Male Yes Sat Dinner 3 [244 rows x 7 columns]
列操作
在 Stata 中,可以在新列或现有列上使用 generate
和 replace
命令进行任意数学表达式运算。drop
命令会从数据集中删除该列。
replace total_bill = total_bill - 2 generate new_bill = total_bill / 2 drop new_bill
pandas 通过在 DataFrame
中指定单独的 Series
来提供矢量化操作。新列可以以相同的方式分配。DataFrame.drop()
方法从 DataFrame
中删除列。
In [9]: tips["total_bill"] = tips["total_bill"] - 2 In [10]: tips["new_bill"] = tips["total_bill"] / 2 In [11]: tips Out[11]: total_bill tip sex smoker day time size new_bill 0 14.99 1.01 Female No Sun Dinner 2 7.495 1 8.34 1.66 Male No Sun Dinner 3 4.170 2 19.01 3.50 Male No Sun Dinner 3 9.505 3 21.68 3.31 Male No Sun Dinner 2 10.840 4 22.59 3.61 Female No Sun Dinner 4 11.295 .. ... ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 13.515 240 25.18 2.00 Female Yes Sat Dinner 2 12.590 241 20.67 2.00 Male Yes Sat Dinner 2 10.335 242 15.82 1.75 Male No Sat Dinner 2 7.910 243 16.78 3.00 Female No Thur Dinner 2 8.390 [244 rows x 8 columns] In [12]: tips = tips.drop("new_bill", axis=1)
筛选
在 Stata 中,通过对一个或多个列使用 if
子句来进行筛选。
list if total_bill > 10
DataFrames 可以通过多种方式进行筛选;其中最直观的是使用布尔索引。
In [13]: tips[tips["total_bill"] > 10] Out[13]: total_bill tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 5 23.29 4.71 Male No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [204 rows x 7 columns]
上述语句只是将一个 Series
对象传递给 DataFrame,返回所有值为 True
的行。
In [14]: is_dinner = tips["time"] == "Dinner" In [15]: is_dinner Out[15]: 0 True 1 True 2 True 3 True 4 True ... 239 True 240 True 241 True 242 True 243 True Name: time, Length: 244, dtype: bool In [16]: is_dinner.value_counts() Out[16]: time True 176 False 68 Name: count, dtype: int64 In [17]: tips[is_dinner] Out[17]: total_bill tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [176 rows x 7 columns]
If/then 逻辑
在 Stata 中,也可以使用 if
子句来创建新列。
generate bucket = "low" if total_bill < 10 replace bucket = "high" if total_bill >= 10
在 pandas 中可以使用 numpy
的 where
方法来执行相同的操作。
In [18]: tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high") In [19]: tips Out[19]: total_bill tip sex smoker day time size bucket 0 14.99 1.01 Female No Sun Dinner 2 high 1 8.34 1.66 Male No Sun Dinner 3 low 2 19.01 3.50 Male No Sun Dinner 3 high 3 21.68 3.31 Male No Sun Dinner 2 high 4 22.59 3.61 Female No Sun Dinner 4 high .. ... ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 high 240 25.18 2.00 Female Yes Sat Dinner 2 high 241 20.67 2.00 Male Yes Sat Dinner 2 high 242 15.82 1.75 Male No Sat Dinner 2 high 243 16.78 3.00 Female No Thur Dinner 2 high [244 rows x 8 columns]
日期功能
Stata 提供了多种函数来对日期/时间列进行操作。
generate date1 = mdy(1, 15, 2013) generate date2 = date("Feb152015", "MDY") generate date1_year = year(date1) generate date2_month = month(date2) * shift date to beginning of next month generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12 replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12 generate months_between = mofd(date2) - mofd(date1) list date1 date2 date1_year date2_month date1_next months_between
下面显示了等效的 pandas 操作。除了这些函数之外,pandas 还支持其他在 Stata 中不可用的时间序列功能(例如时区处理和自定义偏移)- 请参阅时间序列文档了解更多详情。
In [20]: tips["date1"] = pd.Timestamp("2013-01-15") In [21]: tips["date2"] = pd.Timestamp("2015-02-15") In [22]: tips["date1_year"] = tips["date1"].dt.year In [23]: tips["date2_month"] = tips["date2"].dt.month In [24]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin() In [25]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[ ....: "date1" ....: ].dt.to_period("M") ....: In [26]: tips[ ....: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] ....: ] ....: Out[26]: date1 date2 date1_year date2_month date1_next months_between 0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> .. ... ... ... ... ... ... 239 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 240 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 241 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 242 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> 243 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds> [244 rows x 6 columns]
选择列
Stata 提供了关键字来选择、删除和重命名列。
keep sex total_bill tip drop sex rename total_bill total_bill_2
下面展示了 pandas 中相同的操作。
保留特定列
In [27]: tips[["sex", "total_bill", "tip"]] Out[27]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除列
In [28]: tips.drop("sex", axis=1) Out[28]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名列
In [29]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[29]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
保留特定列
In [27]: tips[["sex", "total_bill", "tip"]] Out[27]: sex total_bill tip 0 Female 14.99 1.01 1 Male 8.34 1.66 2 Male 19.01 3.50 3 Male 21.68 3.31 4 Female 22.59 3.61 .. ... ... ... 239 Male 27.03 5.92 240 Female 25.18 2.00 241 Male 20.67 2.00 242 Male 15.82 1.75 243 Female 16.78 3.00 [244 rows x 3 columns]
删除列
In [28]: tips.drop("sex", axis=1) Out[28]: total_bill tip smoker day time size 0 14.99 1.01 No Sun Dinner 2 1 8.34 1.66 No Sun Dinner 3 2 19.01 3.50 No Sun Dinner 3 3 21.68 3.31 No Sun Dinner 2 4 22.59 3.61 No Sun Dinner 4 .. ... ... ... ... ... ... 239 27.03 5.92 No Sat Dinner 3 240 25.18 2.00 Yes Sat Dinner 2 241 20.67 2.00 Yes Sat Dinner 2 242 15.82 1.75 No Sat Dinner 2 243 16.78 3.00 No Thur Dinner 2 [244 rows x 6 columns]
重命名列
In [29]: tips.rename(columns={"total_bill": "total_bill_2"}) Out[29]: total_bill_2 tip sex smoker day time size 0 14.99 1.01 Female No Sun Dinner 2 1 8.34 1.66 Male No Sun Dinner 3 2 19.01 3.50 Male No Sun Dinner 3 3 21.68 3.31 Male No Sun Dinner 2 4 22.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 27.03 5.92 Male No Sat Dinner 3 240 25.18 2.00 Female Yes Sat Dinner 2 241 20.67 2.00 Male Yes Sat Dinner 2 242 15.82 1.75 Male No Sat Dinner 2 243 16.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
按值排序
在 Stata 中,通过sort
来进行排序
sort sex total_bill
pandas 有一个DataFrame.sort_values()
方法,接受一个要排序的列列表。
In [30]: tips = tips.sort_values(["sex", "total_bill"]) In [31]: tips Out[31]: total_bill tip sex smoker day time size 67 1.07 1.00 Female Yes Sat Dinner 1 92 3.75 1.00 Female Yes Fri Dinner 2 111 5.25 1.00 Female No Sat Dinner 1 145 6.35 1.50 Female No Thur Lunch 2 135 6.51 1.25 Female No Thur Lunch 2 .. ... ... ... ... ... ... ... 182 43.35 3.50 Male Yes Sun Dinner 3 156 46.17 5.00 Male No Sun Dinner 6 59 46.27 6.73 Male No Sat Dinner 4 212 46.33 9.00 Male No Sat Dinner 4 170 48.81 10.00 Male Yes Sat Dinner 3 [244 rows x 7 columns]
字符串处理
查找字符串的长度
Stata 使用 strlen()
和 ustrlen()
函数分别确定 ASCII 和 Unicode 字符串的长度。
generate strlen_time = strlen(time) generate ustrlen_time = ustrlen(time)
你可以使用Series.str.len()
方法找到字符串的长度。在 Python 3 中,所有字符串都是 Unicode 字符串。len
包括尾随空格。使用 len
和 rstrip
来排除尾随空格。
In [32]: tips["time"].str.len() Out[32]: 67 6 92 6 111 6 145 5 135 5 .. 182 6 156 6 59 6 212 6 170 6 Name: time, Length: 244, dtype: int64 In [33]: tips["time"].str.rstrip().str.len() Out[33]: 67 6 92 6 111 6 145 5 135 5 .. 182 6 156 6 59 6 212 6 170 6 Name: time, Length: 244, dtype: int64
查找子串的位置
Stata 使用 strpos()
函数确定字符串中字符的位置。它接受由第一个参数定义的字符串,并搜索你提供的作为第二个参数的子串的第一个位置。
generate str_position = strpos(sex, "ale")
你可以使用Series.str.find()
方法在字符串列中找到字符的位置。find
搜索子串的第一个位置。如果找到子串,则该方法返回其位置。如果未找到,则返回 -1
。请记住 Python 的索引是从零开始的。
In [34]: tips["sex"].str.find("ale") Out[34]: 67 3 92 3 111 3 145 3 135 3 .. 182 1 156 1 59 1 212 1 170 1 Name: sex, Length: 244, dtype: int64
Pandas 2.2 中文官方教程和指南(六)(4)https://developer.aliyun.com/article/1509744