Python 金融编程第二版（GPT 重译）（二）（3）-阿里云开发者社区

Python 金融编程第二版（GPT 重译）（二）（2）https://developer.aliyun.com/article/1559305

数组和数据框

通常情况下，您可以从ndarray对象中生成DataFrame对象。但是，您也可以通过使用DataFrame类的values属性或NumPy的np.array()函数轻松地从DataFrame生成ndarray对象。

基本分析

像NumPy的ndarray对象一样，pandas的DataFrame类内置了许多便利方法。作为入门，考虑info()方法和+describe()。

In [40]: df.info()  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 9 entries, 2019-01-31 to 2019-09-30
         Freq: M
         Data columns (total 4 columns):
         No1    9 non-null float64
         No2    9 non-null float64
         No3    9 non-null float64
         No4    9 non-null float64
         dtypes: float64(4)
         memory usage: 360.0 bytes
In [41]: df.describe()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[41]:             No1       No2       No3       No4
         count  9.000000  9.000000  9.000000  9.000000
         mean  -0.150212  0.701033  0.289193 -0.387788
         std    0.988306  0.457685  0.579920  0.877532
         min   -1.749765  0.055676 -0.458027 -1.443217
         25%   -0.583595  0.342680 -0.251879 -1.070043
         50%   -0.326238  0.816454  0.222400 -0.455947
         75%    0.184519  0.937082  0.731000 -0.104411
         max    1.618982  1.541605  1.153036  1.361556

提供有关数据、列和索引的元信息。

为每列提供有用的摘要统计信息（针对数值数据）。

此外，您可以轻松地按列或按行获取和累积和，平均值，如下所示：

In [42]: df.sum()  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[42]: No1   -1.351906
         No2    6.309298
         No3    2.602739
         No4   -3.490089
         dtype: float64
In [43]: df.mean()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[43]: No1   -0.150212
         No2    0.701033
         No3    0.289193
         No4   -0.387788
         dtype: float64
In [44]: df.mean(axis=0)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[44]: No1   -0.150212
         No2    0.701033
         No3    0.289193
         No4   -0.387788
         dtype: float64
In [45]: df.mean(axis=1)  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[45]: 2019-01-31   -0.126621
         2019-02-28    0.161669
         2019-03-31    0.010661
         2019-04-30    0.200390
         2019-05-31   -0.264500
         2019-06-30    0.516568
         2019-07-31    0.803539
         2019-08-31   -0.372845
         2019-09-30    0.088650
         Freq: M, dtype: float64
In [46]: df.cumsum()  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[46]:                  No1       No2       No3       No4
         2019-01-31 -1.749765  0.342680  1.153036 -0.252436
         2019-02-28 -0.768445  0.856899  1.374215 -1.322479
         2019-03-31 -0.957941  1.111901  0.916188 -0.887316
         2019-04-30 -1.541536  1.928748  1.588909 -0.991727
         2019-05-31 -2.072816  2.958480  1.150774 -2.110045
         2019-06-30 -0.453834  4.500086  0.898895 -2.952481
         2019-07-31 -0.269316  5.437168  1.629895 -1.590925
         2019-08-31 -0.595554  5.492844  1.852294 -3.034142
         2019-09-30 -1.351906  6.309298  2.602739 -3.490089

逐列求和。

逐列平均值。

逐行平均值。

逐列累积和（从第一个索引位置开始）。

DataFrame对象也按预期理解NumPy通用函数：

In [47]: np.mean(df)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[47]: No1   -0.150212
         No2    0.701033
         No3    0.289193
         No4   -0.387788
         dtype: float64
In [48]: np.log(df)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
         /Users/yves/miniconda3/envs/base/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
           """Entry point for launching an IPython kernel.
Out[48]:                  No1       No2       No3       No4
 2019-01-31       NaN -1.070957  0.142398       NaN
 2019-02-28 -0.018856 -0.665106 -1.508780       NaN
 2019-03-31       NaN -1.366486       NaN -0.832033
 2019-04-30       NaN -0.202303 -0.396425       NaN
 2019-05-31       NaN  0.029299       NaN       NaN
 2019-06-30  0.481797  0.432824       NaN       NaN
 2019-07-31 -1.690005 -0.064984 -0.313341  0.308628
 2019-08-31       NaN -2.888206 -1.503279       NaN
 2019-09-30       NaN -0.202785 -0.287089       NaN
In [49]: np.sqrt(abs(df)) ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[49]:                  No1       No2       No3       No4
 2019-01-31  1.322787  0.585389  1.073795  0.502430
 2019-02-28  0.990616  0.717091  0.470297  1.034429
 2019-03-31  0.435311  0.504977  0.676777  0.659669
 2019-04-30  0.763934  0.903796  0.820196  0.323127
 2019-05-31  0.728890  1.014757  0.661918  1.057506
 2019-06-30  1.272392  1.241614  0.501876  0.917843
 2019-07-31  0.429556  0.968030  0.854986  1.166857
 2019-08-31  0.571173  0.235958  0.471593  1.201340
 2019-09-30  0.869685  0.903578  0.866282  0.675238
In [50]: np.sqrt(abs(df)).sum() ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[50]: No1    7.384345
 No2    7.075190
 No3    6.397719
 No4    7.538440
 dtype: float64
In [51]: 100 * df + 100 ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[51]:                    No1         No2         No3         No4
 2019-01-31  -74.976547  134.268040  215.303580   74.756396
 2019-02-28  198.132079  151.421884  122.117967   -7.004333
 2019-03-31   81.050417  125.500144   54.197301  143.516349
 2019-04-30   41.640495  181.684707  167.272081   89.558886
 2019-05-31   46.871962  202.973269   56.186438  -11.831825
 2019-06-30  261.898166  254.160517   74.812086   15.756426
 2019-07-31  118.451869  193.708220  173.100034  236.155613
 2019-08-31   67.376194  105.567601  122.239961  -44.321700
 2019-09-30   24.364769  181.645401  175.044476   54.405307

逐列平均值。

逐元素自然对数；会发出警告，但计算会继续进行，导致多个NaN值。

绝对值的逐元素平方根 …

… 以及结果的逐列平均值。

数值数据的线性变换。

NumPy 通用函数

通常情况下，您可以将NumPy通用函数应用于pandas的DataFrame对象，只要它们可以应用于包含相同类型数据的ndarray对象。

pandas相当容错，以捕获错误并在相应的数学运算失败时仅放置NaN值。不仅如此，正如之前简要展示的那样，您还可以在许多情况下像处理完整数据集一样处理这些不完整数据集。这非常方便，因为现实往往被不完整的数据集所表征，这比人们希望的更常见。

基本可视化

通常情况下，一旦数据存储在DataFrame对象中，数据的绘制就只需一行代码即可（参见图 5-1）：

In [52]: from pylab import plt, mpl  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         plt.style.use('seaborn')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         mpl.rcParams['font.family'] = 'serif'  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         %matplotlib inline
In [53]: df.cumsum().plot(lw=2.0, figsize=(10, 6));  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
         # plt.savefig('../../images/ch05/pd_plot_01.png')

自定义绘图样式。

将四列的累积和绘制成折线图。

图 5-1。`DataFrame`对象的折线图

基本上，pandas 提供了一个围绕 matplotplib（参见第七章）的包装器，专门设计用于 DataFrame 对象。表 5-4 列出了 plot 方法接受的参数。

表 5-4。plot 方法的参数

参数	格式	描述
`x`	标签/位置，默认为 `None`	仅当列值为 x 刻度时使用
`y`	标签/位置，默认为 `None`	仅当列值为 y 刻度时使用
`subplots`	布尔值，默认为 `False`	在子图中绘制列
`sharex`	布尔值，默认为 `True`	x 轴共享
`sharey`	布尔值，默认为 `False`	y 轴共享
`use_index`	布尔值，默认为 `True`	使用 `DataFrame.index` 作为 x 刻度
`stacked`	布尔值，默认为 `False`	堆叠（仅用于柱状图）
`sort_columns`	布尔值，默认为 `False`	绘图前按字母顺序排序列
`title`	字符串，默认为 `None`	绘图标题
`grid`	布尔值，默认为 `False`	水平和垂直网格线
`legend`	布尔值，默认为 `True`	标签的图例
`ax`	`matplotlib` axis 对象	用于绘图的 `matplotlib` axis 对象
`style`	字符串或列表/字典	线绘图风格（对每列）
`kind`	“`line`”/“`bar`”/“`barh`”/“`kde`”/“`density`”	绘图类型
`logx`	布尔值，默认为 `False`	x 轴的对数缩放
`logy`	布尔值，默认为 `False`	y 轴的对数缩放
`xticks`	序列，默认为 `Index`	绘图的 x 刻度
`yticks`	序列，默认为 `Values`	绘图的 y 刻度
`xlim`	2-元组，列表	x 轴的边界
`ylim`	2-元组，列表	y 轴的边界
`rot`	整数，默认为 `None`	x 刻度的旋转
`secondary_y`	布尔值/序列，默认为 `False`	次要 y 轴
`mark_right`	布尔值，默认为 `True`	次要轴的自动标记
`colormap`	字符串/colormap 对象，默认为 `None`	用于绘图的色图
`kwds`	关键字	传递给 `matplotlib` 的选项

作为另一个示例，考虑绘制相同数据的柱状图（参见图 5-1）。

In [54]: df.plot(kind='bar', figsize=(10, 6));  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         # plt.savefig('../../images/ch05/pd_plot_02.png')

使用 kind 参数来改变绘图类型。

图 5-2。DataFrame 对象的柱状图

Series 类

到目前为止，我们主要使用 pandas 的 DataFrame 类。Series 类是另一个与 pandas 一起提供的重要类。它的特点是只有一列数据。从这个意义上说，它是 DataFrame 类的一个特化，共享许多但不是所有的特征和功能。通常，当从多列 DataFrame 对象中选择单列时，会得到一个 Series 对象：

In [55]: type(df)
Out[55]: pandas.core.frame.DataFrame
In [56]: s = df['No1']
In [57]: s
Out[57]: 2019-01-31   -1.749765
         2019-02-28    0.981321
         2019-03-31   -0.189496
         2019-04-30   -0.583595
         2019-05-31   -0.531280
         2019-06-30    1.618982
         2019-07-31    0.184519
         2019-08-31   -0.326238
         2019-09-30   -0.756352
         Freq: M, Name: No1, dtype: float64
In [58]: type(s)
Out[58]: pandas.core.series.Series

主要的DataFrame方法也适用于Series对象。举例来说，考虑mean()和plot()方法（见图 5-3）：

In [59]: s.mean()
Out[59]: -0.15021177307319458
In [60]: s.plot(lw=2.0, figsize=(10, 6));
         # plt.savefig('../../images/ch05/pd_plot_03.png')

图 5-3。一个 Series 对象的线性图

分组操作

pandas具有强大且灵活的分组功能。它们与SQL中的分组以及 MicrosoftExcel中的数据透视表类似。为了有东西可以分组，我们添加了一列，指示相应数据所属的季度：

In [61]: df['Quarter'] = ['Q1', 'Q1', 'Q1', 'Q2', 'Q2',
                          'Q2', 'Q3', 'Q3', 'Q3']
         df
Out[61]:                  No1       No2       No3       No4 Quarter
         2019-01-31 -1.749765  0.342680  1.153036 -0.252436      Q1
         2019-02-28  0.981321  0.514219  0.221180 -1.070043      Q1
         2019-03-31 -0.189496  0.255001 -0.458027  0.435163      Q1
         2019-04-30 -0.583595  0.816847  0.672721 -0.104411      Q2
         2019-05-31 -0.531280  1.029733 -0.438136 -1.118318      Q2
         2019-06-30  1.618982  1.541605 -0.251879 -0.842436      Q2
         2019-07-31  0.184519  0.937082  0.731000  1.361556      Q3
         2019-08-31 -0.326238  0.055676  0.222400 -1.443217      Q3
         2019-09-30 -0.756352  0.816454  0.750445 -0.455947      Q3

现在，我们可以按Quarter列进行分组，并且可以输出单个组的统计信息：

In [62]: groups = df.groupby('Quarter')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [63]: groups.size()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[63]: Quarter
         Q1    3
         Q2    3
         Q3    3
         dtype: int64
In [64]: groups.mean()  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[64]:               No1       No2       No3       No4
         Quarter
         Q1      -0.319314  0.370634  0.305396 -0.295772
         Q2       0.168035  1.129395 -0.005765 -0.688388
         Q3      -0.299357  0.603071  0.567948 -0.179203
In [65]: groups.max()  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[65]:               No1       No2       No3       No4
         Quarter
         Q1       0.981321  0.514219  1.153036  0.435163
         Q2       1.618982  1.541605  0.672721 -0.104411
         Q3       0.184519  0.937082  0.750445  1.361556
In [66]: groups.aggregate([min, max]).round(2)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[66]:           No1         No2         No3         No4
                   min   max   min   max   min   max   min   max
         Quarter
         Q1      -1.75  0.98  0.26  0.51 -0.46  1.15 -1.07  0.44
         Q2      -0.58  1.62  0.82  1.54 -0.44  0.67 -1.12 -0.10
         Q3      -0.76  0.18  0.06  0.94  0.22  0.75 -1.44  1.36

根据Quarter列进行分组。

给出组中的行数。

给出每列的均值。

给出每列的最大值。

给出每列的最小值和最大值。

也可以通过多个列进行分组。为此，引入另一列，指示索引日期的月份是奇数还是偶数：

In [67]: df['Odd_Even'] = ['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
                           'Odd', 'Even', 'Odd']
In [68]: groups = df.groupby(['Quarter', 'Odd_Even'])
In [69]: groups.size()
Out[69]: Quarter  Odd_Even
         Q1       Even        1
                  Odd         2
         Q2       Even        2
                  Odd         1
         Q3       Even        1
                  Odd         2
         dtype: int64
In [70]: groups[['No1', 'No4']].aggregate([sum, np.mean])
Out[70]:                        No1                 No4
                                sum      mean       sum      mean
         Quarter Odd_Even
         Q1      Even      0.981321  0.981321 -1.070043 -1.070043
                 Odd      -1.939261 -0.969631  0.182727  0.091364
         Q2      Even      1.035387  0.517693 -0.946847 -0.473423
                 Odd      -0.531280 -0.531280 -1.118318 -1.118318
         Q3      Even     -0.326238 -0.326238 -1.443217 -1.443217
                 Odd      -0.571834 -0.285917  0.905609  0.452805

这就是对pandas和DataFrame对象的介绍。后续部分将使用这个工具集来处理真实世界的金融数据。

Python 金融编程第二版（GPT 重译）（二）（4）https://developer.aliyun.com/article/1559310

Python 金融编程第二版（GPT 重译）（二）（3）

数组和数据框

基本分析

NumPy 通用函数

基本可视化

图 5-1。`DataFrame`对象的折线图

图 5-2。DataFrame 对象的柱状图

Series 类

图 5-3。一个 Series 对象的线性图

分组操作

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python 金融编程第二版（GPT 重译）（二）（3）

数组和数据框

基本分析

NumPy 通用函数

基本可视化

图 5-1。DataFrame对象的折线图

图 5-2。DataFrame 对象的柱状图

Series 类

图 5-3。一个 Series 对象的线性图

分组操作

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

图 5-1。`DataFrame`对象的折线图