Pandas 2.2 中文官方教程和指南(二十三)(1)https://developer.aliyun.com/article/1508852
pandas.eval()
引擎
有两种不同的表达式引擎。
'numexpr'
引擎是更高性能的引擎,可以相对于大型DataFrame
的标准 Python 语法提供性能改进。这个引擎需要安装可选依赖numexpr
。
'python'
引擎通常不有用,除非用于测试其他评估引擎。使用engine='python'
和可能会导致性能下降,不会获得任何性能优势。
In [40]: %timeit df1 + df2 + df3 + df4 7.42 ms +- 81.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [41]: %timeit pd.eval("df1 + df2 + df3 + df4", engine="python") 8.11 ms +- 161 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
DataFrame.eval()
方法
除了顶层的pandas.eval()
函数外,还可以在DataFrame
的“上下文”中评估表达式。
In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=["a", "b"]) In [43]: df.eval("a + b") Out[43]: 0 -0.161099 1 0.805452 2 0.747447 3 1.189042 4 -2.057490 dtype: float64
任何有效的pandas.eval()
表达式也是有效的DataFrame.eval()
表达式,额外的好处是您不必在要评估的列名前加上DataFrame
的名称。
此外,您可以在表达式中执行列的赋值。这允许公式化评估。赋值目标可以是新列名或现有列名,必须是有效的 Python 标识符。
In [44]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10))) In [45]: df = df.eval("c = a + b") In [46]: df = df.eval("d = a + b + c") In [47]: df = df.eval("a = 1") In [48]: df Out[48]: a b c d 0 1 5 5 10 1 1 6 7 14 2 1 7 9 18 3 1 8 11 22 4 1 9 13 26
返回带有新列或修改列的DataFrame
的副本,原始框架保持不变。
In [49]: df Out[49]: a b c d 0 1 5 5 10 1 1 6 7 14 2 1 7 9 18 3 1 8 11 22 4 1 9 13 26 In [50]: df.eval("e = a - c") Out[50]: a b c d e 0 1 5 5 10 -4 1 1 6 7 14 -6 2 1 7 9 18 -8 3 1 8 11 22 -10 4 1 9 13 26 -12 In [51]: df Out[51]: a b c d 0 1 5 5 10 1 1 6 7 14 2 1 7 9 18 3 1 8 11 22 4 1 9 13 26
可以通过使用多行字符串执行多列赋值。
In [52]: df.eval( ....: """ ....: c = a + b ....: d = a + b + c ....: a = 1""", ....: ) ....: Out[52]: a b c d 0 1 5 6 12 1 1 6 7 14 2 1 7 8 16 3 1 8 9 18 4 1 9 10 20
标准 Python 中的等效操作将是
In [53]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10))) In [54]: df["c"] = df["a"] + df["b"] In [55]: df["d"] = df["a"] + df["b"] + df["c"] In [56]: df["a"] = 1 In [57]: df Out[57]: a b c d 0 1 5 5 10 1 1 6 7 14 2 1 7 9 18 3 1 8 11 22 4 1 9 13 26
eval()
性能比较
pandas.eval()
在包含大型数组的表达式中表现良好。
In [58]: nrows, ncols = 20000, 100 In [59]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
DataFrame
算术:
In [60]: %timeit df1 + df2 + df3 + df4 7.34 ms +- 117 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [61]: %timeit pd.eval("df1 + df2 + df3 + df4") 2.85 ms +- 58.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
DataFrame
比较:
In [62]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0) 5.98 ms +- 37 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [63]: %timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)") 9.38 ms +- 36.7 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
DataFrame
具有不对齐轴的算术。
In [64]: s = pd.Series(np.random.randn(50)) In [65]: %timeit df1 + df2 + df3 + df4 + s 12.6 ms +- 105 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [66]: %timeit pd.eval("df1 + df2 + df3 + df4 + s") 3.69 ms +- 62 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
注意
操作如下
1 and 2 # would parse to 1 & 2, but should evaluate to 2 3 or 4 # would parse to 3 | 4, but should evaluate to 3 ~1 # this is okay, but slower when using eval
应该在 Python 中执行。如果尝试对不是bool
或np.bool_
类型的标量操作执行任何布尔/位操作,将引发异常。
这里是一个图表,显示了pandas.eval()
的运行时间与涉及计算的框架大小的关系。两条线代表两种不同的引擎。
只有当您的DataFrame
的行数超过约 10 万行时,使用numexpr
引擎与pandas.eval()
才会带来性能优势。
此图是使用包含使用numpy.random.randn()
生成的浮点值的 3 列的DataFrame
创建的。
使用numexpr
的表达式评估限制
由于NaT
,导致对象 dtype 或涉及日期时间操作的表达式必须在 Python 空间中评估,但表达式的一部分仍然可以使用numexpr
进行评估。例如:
In [67]: df = pd.DataFrame( ....: {"strings": np.repeat(list("cba"), 3), "nums": np.repeat(range(3), 3)} ....: ) ....: In [68]: df Out[68]: strings nums 0 c 0 1 c 0 2 c 0 3 b 1 4 b 1 5 b 1 6 a 2 7 a 2 8 a 2 In [69]: df.query("strings == 'a' and nums == 1") Out[69]: Empty DataFrame Columns: [strings, nums] Index: []
比较的数值部分(nums == 1
)将由numexpr
评估,比较的对象部分("strings == 'a'
)将由 Python 评估。## Cython(为 pandas 编写 C 扩展)
对于许多用例,纯 Python 和 NumPy 编写 pandas 已经足够了。然而,在一些计算密集型应用中,通过将工作转移到cython可以实现相当大的加速。
本教程假设您已经尽可能在 Python 中进行了重构,例如尝试消除 for 循环并利用 NumPy 向量化。在 Python 中进行优化总是值得的。
本教程演示了将一个缓慢的计算转换为 Cython 的“典型”过程。我们使用了来自 Cython 文档的一个示例,但在 pandas 的上下文中。我们的最终 cython 化解决方案比纯 Python 解决方案快约 100 倍。
纯 Python
我们有一个DataFrame
,我们想要逐行应用一个函数。
In [1]: df = pd.DataFrame( ...: { ...: "a": np.random.randn(1000), ...: "b": np.random.randn(1000), ...: "N": np.random.randint(100, 1000, (1000)), ...: "x": "x", ...: } ...: ) ...: In [2]: df Out[2]: a b N x 0 0.469112 -0.218470 585 x 1 -0.282863 -0.061645 841 x 2 -1.509059 -0.723780 251 x 3 -1.135632 0.551225 972 x 4 1.212112 -0.497767 181 x .. ... ... ... .. 995 -1.512743 0.874737 374 x 996 0.933753 1.120790 246 x 997 -0.308013 0.198768 157 x 998 -0.079915 1.757555 977 x 999 -1.010589 -1.115680 770 x [1000 rows x 4 columns]
这是纯 Python 中的函数:
In [3]: def f(x): ...: return x * (x - 1) ...: In [4]: def integrate_f(a, b, N): ...: s = 0 ...: dx = (b - a) / N ...: for i in range(N): ...: s += f(a + i * dx) ...: return s * dx ...:
我们通过使用DataFrame.apply()
(逐行)来实现我们的结果:
In [5]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) 74.9 ms +- 728 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
让我们看看在此操作期间时间花费在哪里,使用prun ipython 魔术函数:
# most time consuming 4 calls In [6]: %prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) # noqa E999 605956 function calls (605938 primitive calls) in 0.167 seconds Ordered by: internal time List reduced from 163 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.097 0.000 0.148 0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f) 552423 0.051 0.000 0.051 0.000 <ipython-input-3-c138bdd570e3>:1(f) 3000 0.003 0.000 0.012 0.000 series.py:1095(__getitem__) 3000 0.002 0.000 0.005 0.000 series.py:1220(_get_value)
绝大多数时间都花在integrate_f
或f
内部,因此我们将集中精力将这两个函数进行 cython 化。### 纯 Cython
首先,我们需要将 Cython 魔术函数导入到 IPython 中:
In [7]: %load_ext Cython
现在,让��们简单地将我们的函数复制到 Cython 中:
In [8]: %%cython ...: def f_plain(x): ...: return x * (x - 1) ...: def integrate_f_plain(a, b, N): ...: s = 0 ...: dx = (b - a) / N ...: for i in range(N): ...: s += f_plain(a + i * dx) ...: return s * dx ...:
In [9]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1) 46.6 ms +- 466 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
与纯 Python 方法相比,这将性能提升了三分之一。### 声明 C 类型
我们可以注释函数变量和返回类型,以及使用cdef
和cpdef
来提高性能:
In [10]: %%cython ....: cdef double f_typed(double x) except? -2: ....: return x * (x - 1) ....: cpdef double integrate_f_typed(double a, double b, int N): ....: cdef int i ....: cdef double s, dx ....: s = 0 ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....:
In [11]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) 7.76 ms +- 83.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
使用 C 类型对函数进行注释,与原始的 Python 实现相比,性能提升了十多倍。### 使用 ndarray
在重新分析时,时间花在从每一行创建一个Series
,并且从索引和系列中调用__getitem__
(每行三次)。这些 Python 函数调用很昂贵,可以通过传递一个np.ndarray
来改进。
In [12]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) 52533 function calls (52515 primitive calls) in 0.019 seconds Ordered by: internal time List reduced from 161 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 3000 0.003 0.000 0.012 0.000 series.py:1095(__getitem__) 3000 0.002 0.000 0.005 0.000 series.py:1220(_get_value) 3000 0.002 0.000 0.002 0.000 base.py:3777(get_loc) 3000 0.002 0.000 0.002 0.000 indexing.py:2765(check_dict_or_set_indexers)
In [13]: %%cython ....: cimport numpy as np ....: import numpy as np ....: cdef double f_typed(double x) except? -2: ....: return x * (x - 1) ....: cpdef double integrate_f_typed(double a, double b, int N): ....: cdef int i ....: cdef double s, dx ....: s = 0 ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b, ....: np.ndarray col_N): ....: assert (col_a.dtype == np.float64 ....: and col_b.dtype == np.float64 and col_N.dtype == np.dtype(int)) ....: cdef Py_ssize_t i, n = len(col_N) ....: assert (len(col_a) == len(col_b) == n) ....: cdef np.ndarray[double] res = np.empty(n) ....: for i in range(len(col_a)): ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i]) ....: return res ....: Content of stderr: In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5, from /home/runner/.cache/ipython/cython/_cython_magic_96d1519457caba8fa4f96b759be00659f51c6b18.c:1215: /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] 17 | #warning "Using deprecated NumPy API, disable it with " \ | ^~~~~~~
这个实现创建一个零数组,并插入应用于每一行的integrate_f_typed
的结果。在 Cython 中循环ndarray
比在Series
对象上循环更快。
由于apply_integrate_f
被定义为接受一个np.ndarray
,因此需要调用Series.to_numpy()
来利用这个函数。
In [14]: %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) 834 us +- 4.04 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
性能比之前的实现提高了近十倍。### 禁用编译器指令
大部分时间现在花在apply_integrate_f
上。禁用 Cython 的boundscheck
和wraparound
检查可以提高性能。
In [15]: %prun -l 4 apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) 78 function calls in 0.001 seconds Ordered by: internal time List reduced from 21 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 3 0.000 0.000 0.000 0.000 frame.py:4062(__getitem__) 3 0.000 0.000 0.000 0.000 base.py:541(to_numpy)
In [16]: %%cython ....: cimport cython ....: cimport numpy as np ....: import numpy as np ....: cdef np.float64_t f_typed(np.float64_t x) except? -2: ....: return x * (x - 1) ....: cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N): ....: cdef np.int64_t i ....: cdef np.float64_t s = 0.0, dx ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....: @cython.boundscheck(False) ....: @cython.wraparound(False) ....: cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap( ....: np.ndarray[np.float64_t] col_a, ....: np.ndarray[np.float64_t] col_b, ....: np.ndarray[np.int64_t] col_N ....: ): ....: cdef np.int64_t i, n = len(col_N) ....: assert len(col_a) == len(col_b) == n ....: cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64) ....: for i in range(n): ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i]) ....: return res ....: Content of stderr: In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5, from /home/runner/.cache/ipython/cython/_cython_magic_3bb7bde31cdaf5ab952bfe5a612c6edef03550d0.c:1216: /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] 17 | #warning "Using deprecated NumPy API, disable it with " \ | ^~~~~~~
In [17]: %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) 620 us +- 2.65 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
然而,循环索引器i
访问数组中的无效位置会导致段错误,因为内存访问没有检查。有关boundscheck
和wraparound
的更多信息,请参阅 Cython 文档中关于编译器指令的部分。### 纯 Python
我们有一个DataFrame
,我们想要逐行应用一个函数。
In [1]: df = pd.DataFrame( ...: { ...: "a": np.random.randn(1000), ...: "b": np.random.randn(1000), ...: "N": np.random.randint(100, 1000, (1000)), ...: "x": "x", ...: } ...: ) ...: In [2]: df Out[2]: a b N x 0 0.469112 -0.218470 585 x 1 -0.282863 -0.061645 841 x 2 -1.509059 -0.723780 251 x 3 -1.135632 0.551225 972 x 4 1.212112 -0.497767 181 x .. ... ... ... .. 995 -1.512743 0.874737 374 x 996 0.933753 1.120790 246 x 997 -0.308013 0.198768 157 x 998 -0.079915 1.757555 977 x 999 -1.010589 -1.115680 770 x [1000 rows x 4 columns]
这是纯 Python 中的函数:
In [3]: def f(x): ...: return x * (x - 1) ...: In [4]: def integrate_f(a, b, N): ...: s = 0 ...: dx = (b - a) / N ...: for i in range(N): ...: s += f(a + i * dx) ...: return s * dx ...:
我们通过使用DataFrame.apply()
(逐行)来实现我们的结果:
In [5]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) 74.9 ms +- 728 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
让我们看看在这个操作中时间花在哪里,使用prun ipython magic function:
# most time consuming 4 calls In [6]: %prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) # noqa E999 605956 function calls (605938 primitive calls) in 0.167 seconds Ordered by: internal time List reduced from 163 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.097 0.000 0.148 0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f) 552423 0.051 0.000 0.051 0.000 <ipython-input-3-c138bdd570e3>:1(f) 3000 0.003 0.000 0.012 0.000 series.py:1095(__getitem__) 3000 0.002 0.000 0.005 0.000 series.py:1220(_get_value)
绝大部分时间都花在integrate_f
或f
内部,因此我们将集中精力将这两个函数进行 Cython 化。
纯 Cython
首先,我们需要将 Cython 魔术函数导入到 IPython 中:
In [7]: %load_ext Cython
现在,让我们简单地将我们的函数复制到 Cython 中:
In [8]: %%cython ...: def f_plain(x): ...: return x * (x - 1) ...: def integrate_f_plain(a, b, N): ...: s = 0 ...: dx = (b - a) / N ...: for i in range(N): ...: s += f_plain(a + i * dx) ...: return s * dx ...:
In [9]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1) 46.6 ms +- 466 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
与纯 Python 方法相比,性能提高了三分之一。
声明 C 类型
我们可以注释函数变量和返回类型,以及使用cdef
和cpdef
来提高性能:
In [10]: %%cython ....: cdef double f_typed(double x) except? -2: ....: return x * (x - 1) ....: cpdef double integrate_f_typed(double a, double b, int N): ....: cdef int i ....: cdef double s, dx ....: s = 0 ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....:
In [11]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) 7.76 ms +- 83.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
使用 C 类型注释函数与原始 Python 实现相比,性能提高了十倍以上。
Pandas 2.2 中文官方教程和指南(二十三)(3)https://developer.aliyun.com/article/1508855