Pandas 2.2 中文官方教程和指南（十六）（3）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十六）（2）https://developer.aliyun.com/article/1509817

删除缺失数据

dropna()删除带有缺失数据的行或列。

In [83]: df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]])
In [84]: df
Out[84]: 
 0  1    2
0  NaN  1  2.0
1  1.0  2  NaN
2  1.0  2  3.0
In [85]: df.dropna()
Out[85]: 
 0  1    2
2  1.0  2  3.0
In [86]: df.dropna(axis=1)
Out[86]: 
 1
0  1
1  2
2  2
In [87]: ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]")
In [88]: ser.dropna()
Out[88]: 
0    1
dtype: int64[pyarrow]

填充缺失数据

按值填充

fillna()用非 NA 数据替换 NA 值。

用标量值替换 NA 值

In [89]: data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
In [90]: df = pd.DataFrame(data)
In [91]: df
Out[91]: 
 np  arrow
0  1.0    1.0
1  NaN   <NA>
2  NaN   <NA>
3  2.0    2.0
In [92]: df.fillna(0)
Out[92]: 
 np  arrow
0  1.0    1.0
1  0.0    0.0
2  0.0    0.0
3  2.0    2.0

向前或向后填补间隙

In [93]: df.ffill()
Out[93]: 
 np  arrow
0  1.0    1.0
1  1.0    1.0
2  1.0    1.0
3  2.0    2.0
In [94]: df.bfill()
Out[94]: 
 np  arrow
0  1.0    1.0
1  2.0    2.0
2  2.0    2.0
3  2.0    2.0

限制填充的 NA 值数量

In [95]: df.ffill(limit=1)
Out[95]: 
 np  arrow
0  1.0    1.0
1  1.0    1.0
2  NaN   <NA>
3  2.0    2.0

NA 值可以用原始对象和填充对象之间的索引和列对齐的Series或DataFrame中的相应值替换。

In [96]: dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))
In [97]: dff.iloc[3:5, 0] = np.nan
In [98]: dff.iloc[4:6, 1] = np.nan
In [99]: dff.iloc[5:8, 2] = np.nan
In [100]: dff
Out[100]: 
 A     B     C
0   0.0   1.0   2.0
1   3.0   4.0   5.0
2   6.0   7.0   8.0
3   NaN  10.0  11.0
4   NaN   NaN  14.0
5  15.0   NaN   NaN
6  18.0  19.0   NaN
7  21.0  22.0   NaN
8  24.0  25.0  26.0
9  27.0  28.0  29.0
In [101]: dff.fillna(dff.mean())
Out[101]: 
 A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000

注意

DataFrame.where()也可用于填充 NA 值。与上述结果相同。

In [102]: dff.where(pd.notna(dff), dff.mean(), axis="columns")
Out[102]: 
 A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000 
```### 插值
`DataFrame.interpolate()`和`Series.interpolate()`使用各种插值方法填充 NA 值。
```py
In [103]: df = pd.DataFrame(
 .....:    {
 .....:        "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
 .....:        "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
 .....:    }
 .....: )
 .....: 
In [104]: df
Out[104]: 
 A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40
In [105]: df.interpolate()
Out[105]: 
 A      B
0  1.0   0.25
1  2.1   1.50
2  3.4   2.75
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40
In [106]: idx = pd.date_range("2020-01-01", periods=10, freq="D")
In [107]: data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
In [108]: ts = pd.Series(data, index=idx)
In [109]: ts.iloc[[1, 2, 5, 6, 9]] = np.nan
In [110]: ts
Out[110]: 
2020-01-01    8.0
2020-01-02    NaN
2020-01-03    NaN
2020-01-04    2.0
2020-01-05    4.0
2020-01-06    NaN
2020-01-07    NaN
2020-01-08    0.0
2020-01-09    3.0
2020-01-10    NaN
Freq: D, dtype: float64
In [111]: ts.plot()
Out[111]: <Axes: >

In [112]: ts.interpolate()
Out[112]: 
2020-01-01    8.000000
2020-01-02    6.000000
2020-01-03    4.000000
2020-01-04    2.000000
2020-01-05    4.000000
2020-01-06    2.666667
2020-01-07    1.333333
2020-01-08    0.000000
2020-01-09    3.000000
2020-01-10    3.000000
Freq: D, dtype: float64
In [113]: ts.interpolate().plot()
Out[113]: <Axes: >

相对于DatetimeIndex中的Timestamp进行插值，可通过设置method="time"来实现。

In [114]: ts2 = ts.iloc[[0, 1, 3, 7, 9]]
In [115]: ts2
Out[115]: 
2020-01-01    8.0
2020-01-02    NaN
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    NaN
dtype: float64
In [116]: ts2.interpolate()
Out[116]: 
2020-01-01    8.0
2020-01-02    5.0
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    0.0
dtype: float64
In [117]: ts2.interpolate(method="time")
Out[117]: 
2020-01-01    8.0
2020-01-02    6.0
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    0.0
dtype: float64

对于浮点索引，请使用method='values'：

In [118]: idx = [0.0, 1.0, 10.0]
In [119]: ser = pd.Series([0.0, np.nan, 10.0], idx)
In [120]: ser
Out[120]: 
0.0      0.0
1.0      NaN
10.0    10.0
dtype: float64
In [121]: ser.interpolate()
Out[121]: 
0.0      0.0
1.0      5.0
10.0    10.0
dtype: float64
In [122]: ser.interpolate(method="values")
Out[122]: 
0.0      0.0
1.0      1.0
10.0    10.0
dtype: float64

如果您安装了scipy，您可以将一个 1-d 插值例程的名称传递给method。如在 scipy 插值文档和参考指南中指定的。适当的插值方法将取决于数据类型。

提��

如果你处理的时间序列以递增速率增长，请使用method='barycentric'。

如果您有近似累积分布函数的值，请使用method='pchip'。

为了填补缺失值以实现平滑绘图的目的，请使用method='akima'。

In [123]: df = pd.DataFrame(
 .....:   {
 .....:      "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
 .....:      "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
 .....:   }
 .....: )
 .....: 
In [124]: df
Out[124]: 
 A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40
In [125]: df.interpolate(method="barycentric")
Out[125]: 
 A       B
0  1.00   0.250
1  2.10  -7.660
2  3.53  -4.515
3  4.70   4.000
4  5.60  12.200
5  6.80  14.400
In [126]: df.interpolate(method="pchip")
Out[126]: 
 A          B
0  1.00000   0.250000
1  2.10000   0.672808
2  3.43454   1.928950
3  4.70000   4.000000
4  5.60000  12.200000
5  6.80000  14.400000
In [127]: df.interpolate(method="akima")
Out[127]: 
 A          B
0  1.000000   0.250000
1  2.100000  -0.873316
2  3.406667   0.320034
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

当通过多项式或样条近似进行插值时，您还必须指定近似的次数或阶数：

In [128]: df.interpolate(method="spline", order=2)
Out[128]: 
 A          B
0  1.000000   0.250000
1  2.100000  -0.428598
2  3.404545   1.206900
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000
In [129]: df.interpolate(method="polynomial", order=2)
Out[129]: 
 A          B
0  1.000000   0.250000
1  2.100000  -2.703846
2  3.451351  -1.453846
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

比较几种方法。

In [130]: np.random.seed(2)
In [131]: ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))
In [132]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
In [133]: ser.iloc[missing] = np.nan
In [134]: methods = ["linear", "quadratic", "cubic"]
In [135]: df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
In [136]: df.plot()
Out[136]: <Axes: >

通过Series.reindex()从扩展数据中插值新观测。

In [137]: ser = pd.Series(np.sort(np.random.uniform(size=100)))
# interpolate at new_index
In [138]: new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75]))
In [139]: interp_s = ser.reindex(new_index).interpolate(method="pchip")
In [140]: interp_s.loc[49:51]
Out[140]: 
49.00    0.471410
49.25    0.476841
49.50    0.481780
49.75    0.485998
50.00    0.489266
50.25    0.491814
50.50    0.493995
50.75    0.495763
51.00    0.497074
dtype: float64

插值限制

interpolate() 接受一个 limit 关键字参数，以限制自上次有效观察以来填充的连续 NaN 值的数量

In [141]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
In [142]: ser
Out[142]: 
0     NaN
1     NaN
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7     NaN
8     NaN
dtype: float64
In [143]: ser.interpolate()
Out[143]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64
In [144]: ser.interpolate(limit=1)
Out[144]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5     NaN
6    13.0
7    13.0
8     NaN
dtype: float64

默认情况下，NaN 值以 forward 方向填充。使用 limit_direction 参数以 backward 或从 both 方向填充。

In [145]: ser.interpolate(limit=1, limit_direction="backward")
Out[145]: 
0     NaN
1     5.0
2     5.0
3     NaN
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64
In [146]: ser.interpolate(limit=1, limit_direction="both")
Out[146]: 
0     NaN
1     5.0
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7    13.0
8     NaN
dtype: float64
In [147]: ser.interpolate(limit_direction="both")
Out[147]: 
0     5.0
1     5.0
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64

默认情况下，NaN 值会被填充，无论它们是否被现有有效值包围或在现有有效值之外。limit_area 参数限制填充到值的内部或外部。

# fill one consecutive inside value in both directions
In [148]: ser.interpolate(limit_direction="both", limit_area="inside", limit=1)
Out[148]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64
# fill all consecutive outside values backward
In [149]: ser.interpolate(limit_direction="backward", limit_area="outside")
Out[149]: 
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7     NaN
8     NaN
dtype: float64
# fill all consecutive outside values in both directions
In [150]: ser.interpolate(limit_direction="both", limit_area="outside")
Out[150]: 
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7    13.0
8    13.0
dtype: float64 
```### 替换值
`Series.replace()` 和 `DataFrame.replace()` 可以类似于 `Series.fillna()` 和 `DataFrame.fillna()` 用于替换或插入缺失值。
```py
In [151]: df = pd.DataFrame(np.eye(3))
In [152]: df
Out[152]: 
 0    1    2
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
In [153]: df_missing = df.replace(0, np.nan)
In [154]: df_missing
Out[154]: 
 0    1    2
0  1.0  NaN  NaN
1  NaN  1.0  NaN
2  NaN  NaN  1.0
In [155]: df_filled = df_missing.replace(np.nan, 2)
In [156]: df_filled
Out[156]: 
 0    1    2
0  1.0  2.0  2.0
1  2.0  1.0  2.0
2  2.0  2.0  1.0

通过传递列表可以替换多个值。

In [157]: df_filled.replace([1, 44], [2, 28])
Out[157]: 
 0    1    2
0  2.0  2.0  2.0
1  2.0  2.0  2.0
2  2.0  2.0  2.0

使用映射字典进行替换。

In [158]: df_filled.replace({1: 44, 2: 28})
Out[158]: 
 0     1     2
0  44.0  28.0  28.0
1  28.0  44.0  28.0
2  28.0  28.0  44.0

正则表达式替换

注意

使用以 r 字符为前缀的 Python 字符串，例如 r'hello world' 是“原始”字符串。它们在反斜杠方面具有不同的语义，与没有此前缀的字符串不同。原始字符串中的反斜杠将被解释为转义的反斜杠，例如，r'\' == '\\'。

用 NaN 替换‘.’

In [159]: d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
In [160]: df = pd.DataFrame(d)
In [161]: df.replace(".", np.nan)
Out[161]: 
 a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

用删除周围空格的正则表达式将‘.’ 替换为 NaN

In [162]: df.replace(r"\s*\.\s*", np.nan, regex=True)
Out[162]: 
 a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

用正则表达式列表替换。

In [163]: df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)
Out[163]: 
 a       b       c
0  0  astuff  astuff
1  1       b       b
2  2     dot     NaN
3  3     dot       d

用映射字典中的正则表达式替换。

In [164]: df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)
Out[164]: 
 a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

传递使用 regex 关键字的正则表达式嵌套字典。

In [165]: df.replace({"b": {"b": r""}}, regex=True)
Out[165]: 
 a  b    c
0  0  a    a
1  1       b
2  2  .  NaN
3  3  .    d
In [166]: df.replace(regex={"b": {r"\s*\.\s*": np.nan}})
Out[166]: 
 a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d
In [167]: df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)
Out[167]: 
 a    b    c
0  0    a    a
1  1    b    b
2  2  .ty  NaN
3  3  .ty    d

传递一组正则表达式，将匹配项替换为标量。

In [168]: df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True)
Out[168]: 
 a            b            c
0  0  placeholder  placeholder
1  1  placeholder  placeholder
2  2  placeholder          NaN
3  3  placeholder            d

所有正则表达式示例也可以作为 to_replace 参数传递给 regex 参数。在这种情况下，必须通过名称显式传递 value 参数或 regex 必须是一个嵌套字典。

In [169]: df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder")
Out[169]: 
 a            b            c
0  0  placeholder  placeholder
1  1  placeholder  placeholder
2  2  placeholder          NaN
3  3  placeholder            d

注意

也可以传递由 re.compile 创建的正则表达式对象作为有效输入。### 按值填充

fillna() 用非 NA 数据替换 NA 值。

��标量值替换 NA 值

In [89]: data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
In [90]: df = pd.DataFrame(data)
In [91]: df
Out[91]: 
 np  arrow
0  1.0    1.0
1  NaN   <NA>
2  NaN   <NA>
3  2.0    2.0
In [92]: df.fillna(0)
Out[92]: 
 np  arrow
0  1.0    1.0
1  0.0    0.0
2  0.0    0.0
3  2.0    2.0

向前或向后填充间隙

In [93]: df.ffill()
Out[93]: 
 np  arrow
0  1.0    1.0
1  1.0    1.0
2  1.0    1.0
3  2.0    2.0
In [94]: df.bfill()
Out[94]: 
 np  arrow
0  1.0    1.0
1  2.0    2.0
2  2.0    2.0
3  2.0    2.0

限制填充的 NA 值数量

In [95]: df.ffill(limit=1)
Out[95]: 
 np  arrow
0  1.0    1.0
1  1.0    1.0
2  NaN   <NA>
3  2.0    2.0

可以用 Series 或 DataFrame 中对应值替换 NA 值，其中原始对象和填充对象之间的索引和列对齐。

In [96]: dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))
In [97]: dff.iloc[3:5, 0] = np.nan
In [98]: dff.iloc[4:6, 1] = np.nan
In [99]: dff.iloc[5:8, 2] = np.nan
In [100]: dff
Out[100]: 
 A     B     C
0   0.0   1.0   2.0
1   3.0   4.0   5.0
2   6.0   7.0   8.0
3   NaN  10.0  11.0
4   NaN   NaN  14.0
5  15.0   NaN   NaN
6  18.0  19.0   NaN
7  21.0  22.0   NaN
8  24.0  25.0  26.0
9  27.0  28.0  29.0
In [101]: dff.fillna(dff.mean())
Out[101]: 
 A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000

注意

DataFrame.where() 也可用于填充 NA 值。结果与上述相同。

In [102]: dff.where(pd.notna(dff), dff.mean(), axis="columns")
Out[102]: 
 A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000

Pandas 2.2 中文官方教程和指南（十六）（4）https://developer.aliyun.com/article/1509819

Pandas 2.2 中文官方教程和指南（十六）（3）

删除缺失数据

填充缺失数据

按值填充

插值限制

正则表达式替换

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Pandas 2.2 中文官方教程和指南（十六）（3）

删除缺失数据

填充缺失数据

按值填充

插值限制

正则表达式替换

热门文章

最新文章

相关课程

相关电子书

相关实验场景