Pandas 2.2 中文官方教程和指南（二十三）（2）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（二十三）（1）https://developer.aliyun.com/article/1508852

`pandas.eval()` 引擎

有两种不同的表达式引擎。

'numexpr'引擎是更高性能的引擎，可以相对于大型DataFrame的标准 Python 语法提供性能改进。这个引擎需要安装可选依赖numexpr。

'python' 引擎通常不有用，除非用于测试其他评估引擎。使用engine='python'和可能会导致性能下降，不会获得任何性能优势。

In [40]: %timeit df1 + df2 + df3 + df4
7.42 ms +- 81.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [41]: %timeit pd.eval("df1 + df2 + df3 + df4", engine="python")
8.11 ms +- 161 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

`DataFrame.eval()` 方法

除了顶层的pandas.eval()函数外，还可以在DataFrame的“上下文”中评估表达式。

In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=["a", "b"])
In [43]: df.eval("a + b")
Out[43]: 
0   -0.161099
1    0.805452
2    0.747447
3    1.189042
4   -2.057490
dtype: float64

任何有效的pandas.eval()表达式也是有效的DataFrame.eval()表达式，额外的好处是您不必在要评估的列名前加上DataFrame的名称。

此外，您可以在表达式中执行列的赋值。这允许公式化评估。赋值目标可以是新列名或现有列名，必须是有效的 Python 标识符。

In [44]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [45]: df = df.eval("c = a + b")
In [46]: df = df.eval("d = a + b + c")
In [47]: df = df.eval("a = 1")
In [48]: df
Out[48]: 
 a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

返回带有新列或修改列的DataFrame的副本，原始框架保持不变。

In [49]: df
Out[49]: 
 a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26
In [50]: df.eval("e = a - c")
Out[50]: 
 a  b   c   d   e
0  1  5   5  10  -4
1  1  6   7  14  -6
2  1  7   9  18  -8
3  1  8  11  22 -10
4  1  9  13  26 -12
In [51]: df
Out[51]: 
 a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

可以通过使用多行字符串执行多列赋值。

In [52]: df.eval(
 ....: """
 ....: c = a + b
 ....: d = a + b + c
 ....: a = 1""",
 ....: )
 ....: 
Out[52]: 
 a  b   c   d
0  1  5   6  12
1  1  6   7  14
2  1  7   8  16
3  1  8   9  18
4  1  9  10  20

标准 Python 中的等效操作将是

In [53]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [54]: df["c"] = df["a"] + df["b"]
In [55]: df["d"] = df["a"] + df["b"] + df["c"]
In [56]: df["a"] = 1
In [57]: df
Out[57]: 
 a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

`eval()` 性能比较

pandas.eval() 在包含大型数组的表达式中表现良好。

In [58]: nrows, ncols = 20000, 100
In [59]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]

DataFrame 算术：

In [60]: %timeit df1 + df2 + df3 + df4
7.34 ms +- 117 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [61]: %timeit pd.eval("df1 + df2 + df3 + df4")
2.85 ms +- 58.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

DataFrame 比较：

In [62]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
5.98 ms +- 37 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [63]: %timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)")
9.38 ms +- 36.7 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

DataFrame 具有不对齐轴的算术。

In [64]: s = pd.Series(np.random.randn(50))
In [65]: %timeit df1 + df2 + df3 + df4 + s
12.6 ms +- 105 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [66]: %timeit pd.eval("df1 + df2 + df3 + df4 + s")
3.69 ms +- 62 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

注意

操作如下

1 and 2  # would parse to 1 & 2, but should evaluate to 2
3 or 4  # would parse to 3 | 4, but should evaluate to 3
~1  # this is okay, but slower when using eval

应该在 Python 中执行。如果尝试对不是bool或np.bool_类型的标量操作执行任何布尔/位操作，将引发异常。

这里是一个图表，显示了pandas.eval()的运行时间与涉及计算的框架大小的关系。两条线代表两种不同的引擎。

只有当您的DataFrame的行数超过约 10 万行时，使用numexpr引擎与pandas.eval()才会带来性能优势。

此图是使用包含使用numpy.random.randn()生成的浮点值的 3 列的DataFrame创建的。

使用`numexpr`的表达式评估限制

由于NaT，导致对象 dtype 或涉及日期时间操作的表达式必须在 Python 空间中评估，但表达式的一部分仍然可以使用numexpr进行评估。例如：

In [67]: df = pd.DataFrame(
 ....:    {"strings": np.repeat(list("cba"), 3), "nums": np.repeat(range(3), 3)}
 ....: )
 ....: 
In [68]: df
Out[68]: 
 strings  nums
0       c     0
1       c     0
2       c     0
3       b     1
4       b     1
5       b     1
6       a     2
7       a     2
8       a     2
In [69]: df.query("strings == 'a' and nums == 1")
Out[69]: 
Empty DataFrame
Columns: [strings, nums]
Index: []

比较的数值部分（nums == 1）将由numexpr评估，比较的对象部分（"strings == 'a'）将由 Python 评估。## Cython（为 pandas 编写 C 扩展）

对于许多用例，纯 Python 和 NumPy 编写 pandas 已经足够了。然而，在一些计算密集型应用中，通过将工作转移到cython可以实现相当大的加速。

本教程假设您已经尽可能在 Python 中进行了重构，例如尝试消除 for 循环并利用 NumPy 向量化。在 Python 中进行优化总是值得的。

本教程演示了将一个缓慢的计算转换为 Cython 的“典型”过程。我们使用了来自 Cython 文档的一个示例，但在 pandas 的上下文中。我们的最终 cython 化解决方案比纯 Python 解决方案快约 100 倍。

纯 Python

我们有一个DataFrame，我们想要逐行应用一个函数。

In [1]: df = pd.DataFrame(
 ...:    {
 ...:        "a": np.random.randn(1000),
 ...:        "b": np.random.randn(1000),
 ...:        "N": np.random.randint(100, 1000, (1000)),
 ...:        "x": "x",
 ...:    }
 ...: )
 ...: 
In [2]: df
Out[2]: 
 a         b    N  x
0    0.469112 -0.218470  585  x
1   -0.282863 -0.061645  841  x
2   -1.509059 -0.723780  251  x
3   -1.135632  0.551225  972  x
4    1.212112 -0.497767  181  x
..        ...       ...  ... ..
995 -1.512743  0.874737  374  x
996  0.933753  1.120790  246  x
997 -0.308013  0.198768  157  x
998 -0.079915  1.757555  977  x
999 -1.010589 -1.115680  770  x
[1000 rows x 4 columns]

这是纯 Python 中的函数：

In [3]: def f(x):
 ...:    return x * (x - 1)
 ...: 
In [4]: def integrate_f(a, b, N):
 ...:    s = 0
 ...:    dx = (b - a) / N
 ...:    for i in range(N):
 ...:        s += f(a + i * dx)
 ...:    return s * dx
 ...:

我们通过使用DataFrame.apply()（逐行）来实现我们的结果：

In [5]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
74.9 ms +- 728 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

让我们看看在此操作期间时间花费在哪里，使用prun ipython 魔术函数：

# most time consuming 4 calls
In [6]: %prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)  # noqa E999
 605956 function calls (605938 primitive calls) in 0.167 seconds
 Ordered by: internal time
 List reduced from 163 to 4 due to restriction <4>
 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1000    0.097    0.000    0.148    0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f)
 552423    0.051    0.000    0.051    0.000 <ipython-input-3-c138bdd570e3>:1(f)
 3000    0.003    0.000    0.012    0.000 series.py:1095(__getitem__)
 3000    0.002    0.000    0.005    0.000 series.py:1220(_get_value)

绝大多数时间都花在integrate_f或f内部，因此我们将集中精力将这两个函数进行 cython 化。### 纯 Cython

首先，我们需要将 Cython 魔术函数导入到 IPython 中：

In [7]: %load_ext Cython

现在，让��们简单地将我们的函数复制到 Cython 中：

In [8]: %%cython
 ...: def f_plain(x):
 ...:    return x * (x - 1)
 ...: def integrate_f_plain(a, b, N):
 ...:    s = 0
 ...:    dx = (b - a) / N
 ...:    for i in range(N):
 ...:        s += f_plain(a + i * dx)
 ...:    return s * dx
 ...:

In [9]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
46.6 ms +- 466 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

与纯 Python 方法相比，这将性能提升了三分之一。### 声明 C 类型

我们可以注释函数变量和返回类型，以及使用cdef和cpdef来提高性能：

In [10]: %%cython
 ....: cdef double f_typed(double x) except? -2:
 ....:    return x * (x - 1)
 ....: cpdef double integrate_f_typed(double a, double b, int N):
 ....:    cdef int i
 ....:    cdef double s, dx
 ....:    s = 0
 ....:    dx = (b - a) / N
 ....:    for i in range(N):
 ....:        s += f_typed(a + i * dx)
 ....:    return s * dx
 ....:

In [11]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
7.76 ms +- 83.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

使用 C 类型对函数进行注释，与原始的 Python 实现相比，性能提升了十多倍。### 使用 ndarray

在重新分析时，时间花在从每一行创建一个Series，并且从索引和系列中调用__getitem__（每行三次）。这些 Python 函数调用很昂贵，可以通过传递一个np.ndarray来改进。

In [12]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
 52533 function calls (52515 primitive calls) in 0.019 seconds
 Ordered by: internal time
 List reduced from 161 to 4 due to restriction <4>
 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 3000    0.003    0.000    0.012    0.000 series.py:1095(__getitem__)
 3000    0.002    0.000    0.005    0.000 series.py:1220(_get_value)
 3000    0.002    0.000    0.002    0.000 base.py:3777(get_loc)
 3000    0.002    0.000    0.002    0.000 indexing.py:2765(check_dict_or_set_indexers)

In [13]: %%cython
 ....: cimport numpy as np
 ....: import numpy as np
 ....: cdef double f_typed(double x) except? -2:
 ....:    return x * (x - 1)
 ....: cpdef double integrate_f_typed(double a, double b, int N):
 ....:    cdef int i
 ....:    cdef double s, dx
 ....:    s = 0
 ....:    dx = (b - a) / N
 ....:    for i in range(N):
 ....:        s += f_typed(a + i * dx)
 ....:    return s * dx
 ....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
 ....:                                           np.ndarray col_N):
 ....:    assert (col_a.dtype == np.float64
 ....:            and col_b.dtype == np.float64 and col_N.dtype == np.dtype(int))
 ....:    cdef Py_ssize_t i, n = len(col_N)
 ....:    assert (len(col_a) == len(col_b) == n)
 ....:    cdef np.ndarray[double] res = np.empty(n)
 ....:    for i in range(len(col_a)):
 ....:        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
 ....:    return res
 ....: 
Content of stderr:
In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,
 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,
 from /home/runner/.cache/ipython/cython/_cython_magic_96d1519457caba8fa4f96b759be00659f51c6b18.c:1215:
/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 17 | #warning "Using deprecated NumPy API, disable it with " \
 |  ^~~~~~~

这个实现创建一个零数组，并插入应用于每一行的integrate_f_typed的结果。在 Cython 中循环ndarray比在Series对象上循环更快。

由于apply_integrate_f被定义为接受一个np.ndarray，因此需要调用Series.to_numpy()来利用这个函数。

In [14]: %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
834 us +- 4.04 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)

性能比之前的实现提高了近十倍。### 禁用编译器指令

大部分时间现在花在apply_integrate_f上。禁用 Cython 的boundscheck和wraparound检查可以提高性能。

In [15]: %prun -l 4 apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
 78 function calls in 0.001 seconds
 Ordered by: internal time
 List reduced from 21 to 4 due to restriction <4>
 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1    0.001    0.001    0.001    0.001 <string>:1(<module>)
 1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
 3    0.000    0.000    0.000    0.000 frame.py:4062(__getitem__)
 3    0.000    0.000    0.000    0.000 base.py:541(to_numpy)

In [16]: %%cython
 ....: cimport cython
 ....: cimport numpy as np
 ....: import numpy as np
 ....: cdef np.float64_t f_typed(np.float64_t x) except? -2:
 ....:    return x * (x - 1)
 ....: cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N):
 ....:    cdef np.int64_t i
 ....:    cdef np.float64_t s = 0.0, dx
 ....:    dx = (b - a) / N
 ....:    for i in range(N):
 ....:        s += f_typed(a + i * dx)
 ....:    return s * dx
 ....: @cython.boundscheck(False)
 ....: @cython.wraparound(False)
 ....: cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap(
 ....:    np.ndarray[np.float64_t] col_a,
 ....:    np.ndarray[np.float64_t] col_b,
 ....:    np.ndarray[np.int64_t] col_N
 ....: ):
 ....:    cdef np.int64_t i, n = len(col_N)
 ....:    assert len(col_a) == len(col_b) == n
 ....:    cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64)
 ....:    for i in range(n):
 ....:        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
 ....:    return res
 ....: 
Content of stderr:
In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,
 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,
 from /home/runner/.cache/ipython/cython/_cython_magic_3bb7bde31cdaf5ab952bfe5a612c6edef03550d0.c:1216:
/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 17 | #warning "Using deprecated NumPy API, disable it with " \
 |  ^~~~~~~

In [17]: %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())
620 us +- 2.65 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)

然而，循环索引器i访问数组中的无效位置会导致段错误，因为内存访问没有检查。有关boundscheck和wraparound的更多信息，请参阅 Cython 文档中关于编译器指令的部分。### 纯 Python

我们有一个DataFrame，我们想要逐行应用一个函数。

In [1]: df = pd.DataFrame(
 ...:    {
 ...:        "a": np.random.randn(1000),
 ...:        "b": np.random.randn(1000),
 ...:        "N": np.random.randint(100, 1000, (1000)),
 ...:        "x": "x",
 ...:    }
 ...: )
 ...: 
In [2]: df
Out[2]: 
 a         b    N  x
0    0.469112 -0.218470  585  x
1   -0.282863 -0.061645  841  x
2   -1.509059 -0.723780  251  x
3   -1.135632  0.551225  972  x
4    1.212112 -0.497767  181  x
..        ...       ...  ... ..
995 -1.512743  0.874737  374  x
996  0.933753  1.120790  246  x
997 -0.308013  0.198768  157  x
998 -0.079915  1.757555  977  x
999 -1.010589 -1.115680  770  x
[1000 rows x 4 columns]

这是纯 Python 中的函数：

In [3]: def f(x):
 ...:    return x * (x - 1)
 ...: 
In [4]: def integrate_f(a, b, N):
 ...:    s = 0
 ...:    dx = (b - a) / N
 ...:    for i in range(N):
 ...:        s += f(a + i * dx)
 ...:    return s * dx
 ...:

我们通过使用DataFrame.apply()（逐行）来实现我们的结果：

In [5]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
74.9 ms +- 728 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

让我们看看在这个操作中时间花在哪里，使用prun ipython magic function：

# most time consuming 4 calls
In [6]: %prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)  # noqa E999
 605956 function calls (605938 primitive calls) in 0.167 seconds
 Ordered by: internal time
 List reduced from 163 to 4 due to restriction <4>
 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1000    0.097    0.000    0.148    0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f)
 552423    0.051    0.000    0.051    0.000 <ipython-input-3-c138bdd570e3>:1(f)
 3000    0.003    0.000    0.012    0.000 series.py:1095(__getitem__)
 3000    0.002    0.000    0.005    0.000 series.py:1220(_get_value)

绝大部分时间都花在integrate_f或f内部，因此我们将集中精力将这两个函数进行 Cython 化。

纯 Cython

首先，我们需要将 Cython 魔术函数导入到 IPython 中：

In [7]: %load_ext Cython

现在，让我们简单地将我们的函数复制到 Cython 中：

In [8]: %%cython
 ...: def f_plain(x):
 ...:    return x * (x - 1)
 ...: def integrate_f_plain(a, b, N):
 ...:    s = 0
 ...:    dx = (b - a) / N
 ...:    for i in range(N):
 ...:        s += f_plain(a + i * dx)
 ...:    return s * dx
 ...:

In [9]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)
46.6 ms +- 466 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

与纯 Python 方法相比，性能提高了三分之一。

声明 C 类型

我们可以注释函数变量和返回类型，以及使用cdef和cpdef来提高性能：

In [10]: %%cython
 ....: cdef double f_typed(double x) except? -2:
 ....:    return x * (x - 1)
 ....: cpdef double integrate_f_typed(double a, double b, int N):
 ....:    cdef int i
 ....:    cdef double s, dx
 ....:    s = 0
 ....:    dx = (b - a) / N
 ....:    for i in range(N):
 ....:        s += f_typed(a + i * dx)
 ....:    return s * dx
 ....:

In [11]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)
7.76 ms +- 83.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

使用 C 类型注释函数与原始 Python 实现相比，性能提高了十倍以上。

Pandas 2.2 中文官方教程和指南（二十三）（3）https://developer.aliyun.com/article/1508855

Pandas 2.2 中文官方教程和指南（二十三）（2）

`pandas.eval()` 引擎

`DataFrame.eval()` 方法

`eval()` 性能比较

使用`numexpr`的表达式评估限制

纯 Python

纯 Cython

声明 C 类型

热门文章

最新文章

相关课程

相关电子书

相关实验场景

Pandas 2.2 中文官方教程和指南（二十三）（2）

pandas.eval() 引擎

DataFrame.eval() 方法

eval() 性能比较

使用numexpr的表达式评估限制

纯 Python

纯 Cython

声明 C 类型

热门文章

最新文章

相关课程

相关电子书

相关实验场景

`pandas.eval()` 引擎

`DataFrame.eval()` 方法

`eval()` 性能比较

使用`numexpr`的表达式评估限制