Python 金融编程第二版（二）（5）-阿里云开发者社区

Python 金融编程第二版（二）（4）https://developer.aliyun.com/article/1559405

连接、合并和拼接

本节介绍了在形式上为 DataFrame 对象的两个简单数据集组合的不同方法。这两个简单数据集是：

In [84]: df1 = pd.DataFrame(['100', '200', '300', '400'],
                             index=['a', 'b', 'c', 'd'],
                             columns=['A',])
In [85]: df1
Out[85]:      A
         a  100
         b  200
         c  300
         d  400
In [86]: df2 = pd.DataFrame(['200', '150', '50'],
                             index=['f', 'b', 'd'],
                             columns=['B',])
In [87]: df2
Out[87]:      B
         f  200
         b  150
         d   50

拼接

拼接或附加基本上意味着将行从一个 DataFrame 对象添加到另一个 DataFrame 对象。这可以通过 append() 方法或 pd.concat() 函数完成。一个主要问题是如何处理索引值。

In [88]: df1.append(df2)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[88]:      A    B
         a  100  NaN
         b  200  NaN
         c  300  NaN
         d  400  NaN
         f  NaN  200
         b  NaN  150
         d  NaN   50
In [89]: df1.append(df2, ignore_index=True)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[89]:      A    B
         0  100  NaN
         1  200  NaN
         2  300  NaN
         3  400  NaN
         4  NaN  200
         5  NaN  150
         6  NaN   50
In [90]: pd.concat((df1, df2))  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[90]:      A    B
         a  100  NaN
         b  200  NaN
         c  300  NaN
         d  400  NaN
         f  NaN  200
         b  NaN  150
         d  NaN   50
In [91]: pd.concat((df1, df2), ignore_index=True)  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[91]:      A    B
         0  100  NaN
         1  200  NaN
         2  300  NaN
         3  400  NaN
         4  NaN  200
         5  NaN  150
         6  NaN   50

将来自 df2 的数据附加为 df1 的新行。

做同样的事情，但忽略了索引。

具有与第一个相同的效果，并且…

第二个追加操作，分别。

连接

在连接这两个数据集时，DataFrame 对象的顺序也很重要，但方式不同。只使用第一个 DataFrame 对象的索引值。这种默认行为称为左连接。

In [92]: df1.join(df2)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[92]:      A    B
         a  100  NaN
         b  200  150
         c  300  NaN
         d  400   50
In [93]: df2.join(df1)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[93]:      B    A
         f  200  NaN
         b  150  200
         d   50  400

df1 的索引值相关。

df2 相关的索引值。

一共有四种不同的连接方法可用，每种方法都会导致索引值和相应数据行的处理方式不同。

In [94]: df1.join(df2, how='left')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[94]:      A    B
         a  100  NaN
         b  200  150
         c  300  NaN
         d  400   50
In [95]: df1.join(df2, how='right')  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[95]:      A    B
         f  NaN  200
         b  200  150
         d  400   50
In [96]: df1.join(df2, how='inner')  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[96]:      A    B
         b  200  150
         d  400   50
In [97]: df1.join(df2, how='outer')  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[97]:      A    B
         a  100  NaN
         b  200  150
         c  300  NaN
         d  400   50
         f  NaN  200

左连接是默认操作。

右连接与颠倒 DataFrame 对象的顺序相同。

内连接仅保留那些在两个索引中都找到的索引值。

外连接保留来自两个索引的所有索引值。

也可以基于空的 DataFrame 对象进行连接。在这种情况下，列会被顺序创建，导致行为类似于左连接。

In [98]: df = pd.DataFrame()
In [99]: df['A'] = df1  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [100]: df
Out[100]:      A
          0  NaN
          1  NaN
          2  NaN
          3  NaN
In [101]: df['B'] = df2  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [102]: df
Out[102]:      A    B
          0  NaN  NaN
          1  NaN  NaN
          2  NaN  NaN
          3  NaN  NaN

df1 作为第一列 A。

df2 作为第二列 B。

利用字典组合数据集的方式产生了类似外连接的结果，因为列是同时创建的。

In [103]: df = pd.DataFrame({'A': df1['A'], 'B': df2['B']})  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [104]: df
Out[104]:      A    B
          a  100  NaN
          b  200  150
          c  300  NaN
          d  400   50
          f  NaN  200

DataFrame 对象的列被用作 dict 对象中的值。

合并

虽然连接操作是基于要连接的 DataFrame 对象的索引进行的，但合并操作通常是在两个数据集之间共享的列上进行的。为此，将新列 C 添加到原始的两个 DataFrame 对象中：

In [105]: c = pd.Series([250, 150, 50], index=['b', 'd', 'c'])
          df1['C'] = c
          df2['C'] = c
In [106]: df1
Out[106]:      A      C
          a  100    NaN
          b  200  250.0
          c  300   50.0
          d  400  150.0
In [107]: df2
Out[107]:      B      C
          f  200    NaN
          b  150  250.0
          d   50  150.0

默认情况下，此情况下的合并操作基于单个共享列 C 进行。然而，还有其他选项可用。

In [108]: pd.merge(df1, df2)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[108]:      A      C    B
          0  100    NaN  200
          1  200  250.0  150
          2  400  150.0   50
In [109]: pd.merge(df1, df2, on='C')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[109]:      A      C    B
          0  100    NaN  200
          1  200  250.0  150
          2  400  150.0   50
In [110]: pd.merge(df1, df2, how='outer')  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[110]:      A      C    B
          0  100    NaN  200
          1  200  250.0  150
          2  300   50.0  NaN
          3  400  150.0   50

默认在列 C 上合并。

外部合并也是可能的，保留所有数据行。

还有许多其他类型的合并操作可用，以下代码示例了其中的一些：

In [111]: pd.merge(df1, df2, left_on='A', right_on='B')
Out[111]:      A    C_x    B  C_y
          0  200  250.0  200  NaN
In [112]: pd.merge(df1, df2, left_on='A', right_on='B', how='outer')
Out[112]:      A    C_x    B    C_y
          0  100    NaN  NaN    NaN
          1  200  250.0  200    NaN
          2  300   50.0  NaN    NaN
          3  400  150.0  NaN    NaN
          4  NaN    NaN  150  250.0
          5  NaN    NaN   50  150.0
In [113]: pd.merge(df1, df2, left_index=True, right_index=True)
Out[113]:      A    C_x    B    C_y
          b  200  250.0  150  250.0
          d  400  150.0   50  150.0
In [114]: pd.merge(df1, df2, on='C', left_index=True)
Out[114]:      A      C    B
          f  100    NaN  200
          b  200  250.0  150
          d  400  150.0   50
In [115]: pd.merge(df1, df2, on='C', right_index=True)
Out[115]:      A      C    B
          a  100    NaN  200
          b  200  250.0  150
          d  400  150.0   50
In [116]: pd.merge(df1, df2, on='C', left_index=True, right_index=True)
Out[116]:      A      C    B
          b  200  250.0  150
          d  400  150.0   50

性能方面

本章中的许多示例说明了使用 pandas 可以实现相同目标的多个选项。本节比较了用于逐元素添加两列的此类选项。首先，使用 NumPy 生成的数据集。

In [117]: data = np.random.standard_normal((1000000, 2))  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [118]: data.nbytes  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[118]: 16000000
In [119]: df = pd.DataFrame(data, columns=['x', 'y'])  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [120]: df.info()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          <class 'pandas.core.frame.DataFrame'>
          RangeIndex: 1000000 entries, 0 to 999999
          Data columns (total 2 columns):
          x    1000000 non-null float64
          y    1000000 non-null float64
          dtypes: float64(2)
          memory usage: 15.3 MB

带有随机数字的 ndarray 对象。

带有随机数字的 DataFrame 对象。

第二，一些完成任务的性能值的选项。

In [121]: %time res = df['x'] + df['y']  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 5.68 ms, sys: 14.5 ms, total: 20.1 ms
          Wall time: 4.06 ms
In [122]: res[:3]
Out[122]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
In [123]: %time res = df.sum(axis=1)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 44 ms, sys: 14.9 ms, total: 58.9 ms
          Wall time: 57.6 ms
In [124]: res[:3]
Out[124]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
In [125]: %time res = df.values.sum(axis=1)  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
          CPU times: user 16.1 ms, sys: 1.74 ms, total: 17.8 ms
          Wall time: 16.6 ms
In [126]: res[:3]
Out[126]: array([ 0.3872424 , -0.96934273, -0.86315944])
In [127]: %time res = np.sum(df, axis=1)  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
          CPU times: user 39.7 ms, sys: 8.91 ms, total: 48.7 ms
          Wall time: 47.7 ms
In [128]: res[:3]
Out[128]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
In [129]: %time res = np.sum(df.values, axis=1)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
          CPU times: user 16.1 ms, sys: 1.78 ms, total: 17.9 ms
          Wall time: 16.6 ms
In [130]: res[:3]
Out[130]: array([ 0.3872424 , -0.96934273, -0.86315944])

直接操作列（Series对象）是最快的方法。

这通过在 DataFrame 对象上调用 sum() 方法来计算总和。

这通过在 ndarray 对象上调用 sum() 方法来计算总和。

这通过在 DataFrame 对象上调用 np.sum() 方法来计算总和。

这通过在 ndarray 对象上使用通用函数 np.sum() 方法来计算总和。

最后，更多基于 eval() 和 apply() 方法的选项。

In [131]: %time res = df.eval('x + y')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 13.3 ms, sys: 15.6 ms, total: 28.9 ms
          Wall time: 18.5 ms
In [132]: res[:3]
Out[132]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
In [133]: %time res = df.apply(lambda row: row['x'] + row['y'], axis=1)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 22 s, sys: 71 ms, total: 22.1 s
          Wall time: 22.1 s
In [134]: res[:3]
Out[134]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
# tag::PD_34[]

eval() 是专门用于评估（复杂）数值表达式的方法；可以直接访问列。

最慢的选项是逐行使用 apply() 方法；这就像在 Python 级别上循环遍历所有行。

注意

pandas 通常提供多种选项来实现相同的目标。如果不确定，应该比较一些选项，以确保在时间紧迫时获得最佳性能。在简单示例中，执行时间相差数个数量级。

结论

pandas 是数据分析的强大工具，并已成为所谓 PyData 栈的核心包。它的 DataFrame 类特别适用于处理任何类型的表格数据。对这种对象的大多数操作都是矢量化的，这不仅使代码简洁，而且通常性能很高，与 NumPy 的情况一样。此外，pandas 还使得处理不完整的数据集变得方便，例如，使用 NumPy 并不那么方便。在本书的许多后续章节中，pandas 和 DataFrame 类将是核心，当需要时还将使用和说明其他功能。

进一步阅读

pandas 是一个文档齐全的开源项目，既有在线文档，也有可供下载的 PDF 版本。¹。以下页面提供了所有资源：

http://pandas.pydata.org/

至于 NumPy，在书籍形式上推荐的参考资料是：

McKinney, Wes (2017): Python 数据分析. 第二版, O’Reilly, 北京等地。
VanderPlas, Jake (2016): Python 数据科学手册. O’Reilly, 北京等地。

¹ 在撰写本文时，PDF 版本共有 2,207 页（版本 0.21.1）。

2 -0.863159

dtype: float64

In [125]: %time res = df.values.sum(axis=1)

CPU times: user 16.1 ms, sys: 1.74 ms, total: 17.8 ms
      Wall time: 16.6 ms

In [126]: res[:3]

Out[126]: array([ 0.3872424 , -0.96934273, -0.86315944])

In [127]: %time res = np.sum(df, axis=1)

CPU times: user 39.7 ms, sys: 8.91 ms, total: 48.7 ms
      Wall time: 47.7 ms

In [128]: res[:3]

Out[128]: 0 0.387242

1 -0.969343

2 -0.863159

dtype: float64

In [129]: %time res = np.sum(df.values, axis=1)

CPU times: user 16.1 ms, sys: 1.78 ms, total: 17.9 ms
      Wall time: 16.6 ms

In [130]: res[:3]

Out[130]: array([ 0.3872424 , -0.96934273, -0.86315944])

[外链图片转存中...(img-77URXwpn-1717936154444)]
直接操作列（`Series`对象）是最快的方法。
[外链图片转存中...(img-CypXTJSZ-1717936154444)]
这通过在 `DataFrame` 对象上调用 `sum()` 方法来计算总和。
[外链图片转存中...(img-OsUZDWEL-1717936154444)]
这通过在 `ndarray` 对象上调用 `sum()` 方法来计算总和。
[外链图片转存中...(img-M1iqzyj5-1717936154445)]
这通过在 `DataFrame` 对象上调用 `np.sum()` 方法来计算总和。
[外链图片转存中...(img-3qlbINPv-1717936154445)]
这通过在 `ndarray` 对象上使用通用函数 `np.sum()` 方法来计算总和。
最后，更多基于 `eval()` 和 `apply()` 方法的选项。
```py
In [131]: %time res = df.eval('x + y')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 13.3 ms, sys: 15.6 ms, total: 28.9 ms
          Wall time: 18.5 ms
In [132]: res[:3]
Out[132]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
In [133]: %time res = df.apply(lambda row: row['x'] + row['y'], axis=1)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 22 s, sys: 71 ms, total: 22.1 s
          Wall time: 22.1 s
In [134]: res[:3]
Out[134]: 0    0.387242
          1   -0.969343
          2   -0.863159
          dtype: float64
# tag::PD_34[]

[外链图片转存中…(img-tTQjt0oW-1717936154445)]

eval() 是专门用于评估（复杂）数值表达式的方法；可以直接访问列。

[外链图片转存中…(img-Mrur8m9o-1717936154445)]

最慢的选项是逐行使用 apply() 方法；这就像在 Python 级别上循环遍历所有行。

注意

结论

进一步阅读

pandas 是一个文档齐全的开源项目，既有在线文档，也有可供下载的 PDF 版本。¹。以下页面提供了所有资源：

http://pandas.pydata.org/

至于 NumPy，在书籍形式上推荐的参考资料是：

McKinney, Wes (2017): Python 数据分析. 第二版, O’Reilly, 北京等地。
VanderPlas, Jake (2016): Python 数据科学手册. O’Reilly, 北京等地。

¹ 在撰写本文时，PDF 版本共有 2,207 页（版本 0.21.1）。

Python 金融编程第二版（二）（5）

连接、合并和拼接

拼接

连接

合并

性能方面

注意

结论

进一步阅读

注意

结论

进一步阅读

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python 金融编程第二版（二）（5）

连接、合并和拼接

拼接

连接

合并

性能方面

注意

结论

进一步阅读

注意

结论

进一步阅读

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像