Python 金融编程第二版（GPT 重译）（四）（3）-阿里云开发者社区

Python 金融编程第二版（GPT 重译）（四）（2）https://developer.aliyun.com/article/1559360

读写文本文件

文本处理可以被视为 Python 的一个优势。事实上，许多公司和科学用户正是用 Python 来完成这项任务的。使用 Python，你有多种选择来处理str对象，以及一般的文本文件。

假设有一个相当大的数据集要共享为逗号分隔值（CSV）文件。尽管这些文件具有特殊的内部结构，但它们基本上是纯文本文件。以下代码创建一个虚拟数据集作为ndarray对象，一个DatetimeIndex对象，将两者组合并将数据存储为 CSV 文本文件。

In [26]: import pandas as pd
In [27]: rows = 5000  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         a = np.random.standard_normal((rows, 5)).round(4)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [28]: a  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[28]: array([[-0.9627,  0.1326, -2.012 , -0.299 , -1.4554],
                [ 0.8918,  0.8904, -0.3396, -2.3485,  2.0913],
                [-0.1899, -0.9574,  1.0258,  0.6206, -2.4693],
                ...,
                [ 1.4688, -1.268 , -0.4778,  1.4315, -1.4689],
                [ 1.1162,  0.152 , -0.9363, -0.7869, -0.1147],
                [-0.699 ,  0.3206,  0.3659, -1.0282, -0.4151]])
In [29]: t = pd.date_range(start='2019/1/1', periods=rows, freq='H')  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
In [30]: t  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[30]: DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
                        '2019-01-01 02:00:00', '2019-01-01 03:00:00',
                        '2019-01-01 04:00:00', '2019-01-01 05:00:00',
                        '2019-01-01 06:00:00', '2019-01-01 07:00:00',
                        '2019-01-01 08:00:00', '2019-01-01 09:00:00',
                        ...
                        '2019-07-27 22:00:00', '2019-07-27 23:00:00',
                        '2019-07-28 00:00:00', '2019-07-28 01:00:00',
                        '2019-07-28 02:00:00', '2019-07-28 03:00:00',
                        '2019-07-28 04:00:00', '2019-07-28 05:00:00',
                        '2019-07-28 06:00:00', '2019-07-28 07:00:00'],
                       dtype='datetime64[ns]', length=5000, freq='H')
In [31]: csv_file = open(path + 'data.csv', 'w')  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
In [32]: header = 'date,no1,no2,no3,no4,no5\n'  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
In [33]: csv_file.write(header)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[33]: 25
In [34]: for t_, (no1, no2, no3, no4, no5) in zip(t, a):  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
             s = '{},{},{},{},{},{}\n'.format(t_, no1, no2, no3, no4, no5)  ![7](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/7.png)
             csv_file.write(s)  ![8](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/8.png)
In [35]: csv_file.close()
In [36]: ll $path*
         -rw-r--r--  1 yves  staff  284621 Jan 18 10:05 /Users/yves/Documents/Temp/data/data.csv

定义数据集的行数。

创建具有随机数的ndarray对象。

创建一个适当长度的DatetimeIndex对象（每小时间隔）。

打开一个供写入的文件（w）。

定义标题行（列标签）并将其写为第一行。

数据以行为单位组合…

…转换为str对象…

…并逐行写入（追加到 CSV 文本文件中）。

另一种方法也类似。首先，打开现有的CSV文件。其次，使用file对象的.readline()或.readlines()方法逐行读取其内容：

In [37]: csv_file = open(path + 'data.csv', 'r')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [38]: for i in range(5):
             print(csv_file.readline(), end='')  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
         date,no1,no2,no3,no4,no5
         2019-01-01 00:00:00,-0.9627,0.1326,-2.012,-0.299,-1.4554
         2019-01-01 01:00:00,0.8918,0.8904,-0.3396,-2.3485,2.0913
         2019-01-01 02:00:00,-0.1899,-0.9574,1.0258,0.6206,-2.4693
         2019-01-01 03:00:00,-0.0217,-0.7168,1.7875,1.6226,-0.4857
In [39]: csv_file.close()
In [40]: csv_file = open(path + 'data.csv', 'r')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [41]: content = csv_file.readlines()  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
In [42]: content[:5]  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[42]: ['date,no1,no2,no3,no4,no5\n',
          '2019-01-01 00:00:00,-0.9627,0.1326,-2.012,-0.299,-1.4554\n',
          '2019-01-01 01:00:00,0.8918,0.8904,-0.3396,-2.3485,2.0913\n',
          '2019-01-01 02:00:00,-0.1899,-0.9574,1.0258,0.6206,-2.4693\n',
          '2019-01-01 03:00:00,-0.0217,-0.7168,1.7875,1.6226,-0.4857\n']
In [43]: csv_file.close()

打开文件以供读取（r）。

逐行读取文件内容并打印。

一次性读取文件内容…

… 其结果是一个包含所有行的list对象，每行作为单独的str对象。

CSV文件如此重要且常见，以至于 Python 标准库中有一个csv模块，简化了 CSV 文件的处理。csv模块的两个有用的读取器（迭代器）对象都返回一个list对象的list对象，或者一个list对象的dict对象。

In [44]: import csv
In [45]: with open(path + 'data.csv', 'r') as f:
             csv_reader = csv.reader(f)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
             lines = [line for line in csv_reader]
In [46]: lines[:5]  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[46]: [['date', 'no1', 'no2', 'no3', 'no4', 'no5'],
          ['2019-01-01 00:00:00', '-0.9627', '0.1326', '-2.012', '-0.299', '-1.4554'],
          ['2019-01-01 01:00:00', '0.8918', '0.8904', '-0.3396', '-2.3485', '2.0913'],
          ['2019-01-01 02:00:00', '-0.1899', '-0.9574', '1.0258', '0.6206', '-2.4693'],
          ['2019-01-01 03:00:00', '-0.0217', '-0.7168', '1.7875', '1.6226', '-0.4857']]
In [47]: with open(path + 'data.csv', 'r') as f:
             csv_reader = csv.DictReader(f)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
             lines = [line for line in csv_reader]
In [48]: lines[:3]  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[48]: [OrderedDict([('date', '2019-01-01 00:00:00'),
                       ('no1', '-0.9627'),
                       ('no2', '0.1326'),
                       ('no3', '-2.012'),
                       ('no4', '-0.299'),
                       ('no5', '-1.4554')]),
          OrderedDict([('date', '2019-01-01 01:00:00'),
                       ('no1', '0.8918'),
                       ('no2', '0.8904'),
                       ('no3', '-0.3396'),
                       ('no4', '-2.3485'),
                       ('no5', '2.0913')]),
          OrderedDict([('date', '2019-01-01 02:00:00'),
                       ('no1', '-0.1899'),
                       ('no2', '-0.9574'),
                       ('no3', '1.0258'),
                       ('no4', '0.6206'),
                       ('no5', '-2.4693')])]
In [49]: !rm -f $path*

csv.reader()将每一行都返回为一个list对象。

csv.DictReader()将每一行都返回为OrderedDict，它是dict对象的一种特殊情况。

SQL 数据库

Python 可以与任何类型的SQL数据库一起工作，并且通常也可以与任何类型的NoSQL数据库一起工作。在这种情况下，SQL代表结构化查询语言。Python 默认提供的一个SQL或关系数据库是SQLite3。借助它，可以轻松地说明 Python 对SQL数据库的基本方法:²

In [50]: import sqlite3 as sq3
In [51]: con = sq3.connect(path + 'numbs.db')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [52]: query = 'CREATE TABLE numbs (Date date, No1 real, No2 real)'  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [53]: con.execute(query)  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[53]: <sqlite3.Cursor at 0x1054efb20>
In [54]: con.commit()  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
In [55]: q = con.execute  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
In [56]: q('SELECT * FROM sqlite_master').fetchall()  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
Out[56]: [('table',
           'numbs',
           'numbs',
           2,
           'CREATE TABLE numbs (Date date, No1 real, No2 real)')]

打开数据库连接；如果不存在，则创建一个文件。

这是一个创建包含三列的表的SQL查询。³

执行查询…

… 并提交更改。

这为con.execute()方法定义了一个简短的别名。

这获取关于数据库的元信息，将刚创建的表显示为单个对象。

现在有了一个带有表的数据库文件，可以使用数据填充该表。每行由一个datetime对象和两个float对象组成：

In [57]: import datetime
In [58]: now = datetime.datetime.now()
         q('INSERT INTO numbs VALUES(?, ?, ?)', (now, 0.12, 7.3))  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[58]: <sqlite3.Cursor at 0x1054efc70>
In [59]: np.random.seed(100)
In [60]: data = np.random.standard_normal((10000, 2)).round(4)  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [61]: %%time
         for row in data:  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
             now = datetime.datetime.now()
             q('INSERT INTO numbs VALUES(?, ?, ?)', (now, row[0], row[1]))
         con.commit()
         CPU times: user 111 ms, sys: 3.22 ms, total: 115 ms
         Wall time: 116 ms
In [62]: q('SELECT * FROM numbs').fetchmany(4)  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[62]: [('2018-01-18 10:05:24.043286', 0.12, 7.3),
          ('2018-01-18 10:05:24.071921', -1.7498, 0.3427),
          ('2018-01-18 10:05:24.072110', 1.153, -0.2524),
          ('2018-01-18 10:05:24.072160', 0.9813, 0.5142)]
In [63]: q('SELECT * FROM numbs WHERE no1 > 0.5').fetchmany(4)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[63]: [('2018-01-18 10:05:24.072110', 1.153, -0.2524),
          ('2018-01-18 10:05:24.072160', 0.9813, 0.5142),
          ('2018-01-18 10:05:24.072257', 0.6727, -0.1044),
          ('2018-01-18 10:05:24.072319', 1.619, 1.5416)]
In [64]: pointer = q('SELECT * FROM numbs')  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
In [65]: for i in range(3):
             print(pointer.fetchone())  ![7](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/7.png)
         ('2018-01-18 10:05:24.043286', 0.12, 7.3)
         ('2018-01-18 10:05:24.071921', -1.7498, 0.3427)
         ('2018-01-18 10:05:24.072110', 1.153, -0.2524)
In [66]: rows = pointer.fetchall()  ![8](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/8.png)
         rows[:3]
Out[66]: [('2018-01-18 10:05:24.072160', 0.9813, 0.5142),
          ('2018-01-18 10:05:24.072184', 0.2212, -1.07),
          ('2018-01-18 10:05:24.072202', -0.1895, 0.255)]

将单行（或记录）写入numbs表。

创建一个较大的虚拟数据集作为ndarray对象。

迭代ndarray对象的行。

从表中检索多行。

相同但在no1列的值上有条件。

定义一个指针对象…

…它的行为类似于生成器对象。

.fetchall()检索所有剩余的行。

最后，如果不再需要，可能会想要删除数据库中的表对象。

In [67]: q('DROP TABLE IF EXISTS numbs')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
Out[67]: <sqlite3.Cursor at 0x1054eff80>
In [68]: q('SELECT * FROM sqlite_master').fetchall()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
Out[68]: []
In [69]: con.close()  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
In [70]: !rm -f $path*  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)

从数据库中删除表格。

此操作后不再存在表格对象。

关闭数据库连接。

从磁盘中删除数据库文件。

SQL数据库是一个相当广泛的主题；事实上，在本章中无法对其进行任何重要的涵盖，因为它太广泛且复杂了。基本信息如下：

Python 与几乎所有的数据库技术都能很好地集成。
基本的SQL语法主要由所使用的数据库确定；其余部分如我们所说，都是Pythonic的。

接下来会有几个基于SQLite3的示例。

写入和读取 NumPy 数组

NumPy本身有函数可以以方便和高效的方式写入和读取ndarray对象。在某些情况下，这节省了很多工作，比如当你必须将NumPy的dtype对象转换为特定的数据库类型时（例如对于SQLite3）。为了说明NumPy有时可以有效替代基于SQL的方法，以下代码复制了之前使用NumPy的示例。

代码使用NumPy的np.arange()函数生成一个存储了datetime对象的ndarray对象，而不是使用pandas：⁴

In [71]: dtimes = np.arange('2019-01-01 10:00:00', '2025-12-31 22:00:00',
                           dtype='datetime64[m]')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [72]: len(dtimes)
Out[72]: 3681360
In [73]: dty = np.dtype([('Date', 'datetime64[m]'),
                         ('No1', 'f'), ('No2', 'f')])  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
In [74]: data = np.zeros(len(dtimes), dtype=dty)  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
In [75]: data['Date'] = dtimes  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
In [76]: a = np.random.standard_normal((len(dtimes), 2)).round(4)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
In [77]: data['No1'] = a[:, 0]  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
         data['No2'] = a[:, 1]  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
In [78]: data.nbytes  ![7](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/7.png)
Out[78]: 58901760

创建一个带有datetime作为dtype的ndarray对象。

用于记录数组的特殊dtype对象。

用特殊dtype实例化的ndarray对象。

这将填充Date列。

假数据集……

…这填充了No1和No2列。

记录数组的大小（以字节为单位）。

保存ndarray对象是高度优化的，因此非常快速。大约 60 MB 的数据在磁盘上保存约 0.1 秒（这里使用 SSD）。大小为 480 MB 的较大ndarray对象在磁盘上保存大约需要 1 秒钟。

In [79]: %time np.save(path + 'array', data)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         CPU times: user 4.06 ms, sys: 99.3 ms, total: 103 ms
         Wall time: 107 ms
In [80]: ll $path*  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
         -rw-r--r--  1 yves  staff  58901888 Jan 18 10:05 /Users/yves/Documents/Temp/data/array.npy
In [81]: %time np.load(path + 'array.npy')  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
         CPU times: user 1.81 ms, sys: 47.4 ms, total: 49.2 ms
         Wall time: 46.7 ms
Out[81]: array([('2019-01-01T10:00',  1.51310003,  0.69730002),
                ('2019-01-01T10:01', -1.722     , -0.4815    ),
                ('2019-01-01T10:02',  0.8251    ,  0.3019    ), ...,
                ('2025-12-31T21:57',  1.37199998,  0.64459997),
                ('2025-12-31T21:58', -1.25419998,  0.1612    ),
                ('2025-12-31T21:59', -1.1997    , -1.097     )],
               dtype=[('Date', '<M8[m]'), ('No1', '<f4'), ('No2', '<f4')])
In [82]: %time data = np.random.standard_normal((10000, 6000)).round(4)  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         CPU times: user 2.81 s, sys: 354 ms, total: 3.17 s
         Wall time: 3.23 s
In [83]: data.nbytes  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[83]: 480000000
In [84]: %time np.save(path + 'array', data)  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         CPU times: user 23.9 ms, sys: 878 ms, total: 902 ms
         Wall time: 964 ms
In [85]: ll $path*  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         -rw-r--r--  1 yves  staff  480000080 Jan 18 10:05 /Users/yves/Documents/Temp/data/array.npy
In [86]: %time np.load(path + 'array.npy')  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         CPU times: user 1.95 ms, sys: 441 ms, total: 443 ms
         Wall time: 441 ms
Out[86]: array([[ 0.3066,  0.5951,  0.5826, ...,  1.6773,  0.4294, -0.2216],
                [ 0.8769,  0.7292, -0.9557, ...,  0.5084,  0.9635, -0.4443],
                [-1.2202, -2.5509, -0.0575, ..., -1.6128,  0.4662, -1.3645],
                ...,
                [-0.5598,  0.2393, -2.3716, ...,  1.7669,  0.2462,  1.035 ],
                [ 0.273 ,  0.8216, -0.0749, ..., -0.0552, -0.8396,  0.3077],
                [-0.6305,  0.8331,  1.3702, ...,  0.3493,  0.1981,  0.2037]])
In [87]: !rm -f $path*

这将记录的ndarray对象保存到磁盘上。

磁盘上的大小几乎与内存中的大小相同（由于二进制存储）。

这会从磁盘加载记录的ndarray对象。

一个较大的普通ndarray对象。

这些示例说明，在这种情况下，写入磁盘主要受硬件限制，因为 480 MB/s 大致代表了标准 SSD 在撰写本文时的宣传写入速度（512 MB/s）。

无论如何，可以预期，与使用标准 SQL 数据库或使用标准 pickle 库进行序列化相比，这种形式的数据存储和检索速度要快得多。有两个原因：首先，数据主要是数字；其次，NumPy 实现了二进制存储，几乎将开销降低到零。当然，使用这种方法不具备 SQL 数据库的功能，但是随后的部分将显示 PyTables 将在这方面提供帮助。

`pandas` 中的 I/O

pandas 的一个主要优势之一是它可以原生地读取和写入不同的数据格式，其中包括：

CSV（逗号分隔值）
SQL（结构化查询语言）
XLS/XSLX（微软 Excel 文件）
JSON（JavaScript 对象表示法）
HTML（超文本标记语言）

表 9-1 列出了 pandas 和 DataFrame 类的支持格式以及相应的导入和导出函数/方法。导入函数所接受的参数在 [Link to Come] 中列出并描述（根据函数，可能适用其他约定）。

表 9-1. 导入导出函数和方法

格式	输入	输出	备注
`CSV`	`pd.read_csv()`	`.to_csv()`	文本文件
`XLS/XLSX`	`pd.read_excel()`	`.to_excel()`	电子表格
`HDF`	`pd.read_hdf()`	`.to_hdf()`	`HDF5` 数据库
`SQL`	`pd.read_sql()`	`.to_sql()`	`SQL` 表
`JSON`	`pd.read_json()`	`.to_json()`	`JavaScript 对象表示法`
`MSGPACK`	`pd.read_msgpack()`	`.to_msgpack()`	可移植二进制格式
`HTML`	`pd.read_html()`	`.to_html()`	`HTML` 代码
`GBQ`	`pd.read_gbq()`	`.to_gbq()`	`Google Big Query` 格式
`DTA`	`pd.read_stata()`	`.to_stata()`	格式 104, 105, 108, 113-115, 117
任何	`pd.read_clipboard()`	`.to_clipboard()`	例如，从 `HTML` 页面
任何	`pd.read_pickle()`	`.to_pickle()`	（结构化的）Python 对象

测试案例再次是一个较大的 float 对象集合：

In [88]: data = np.random.standard_normal((1000000, 5)).round(4)
In [89]: data[:3]
Out[89]: array([[ 0.4918,  1.3707,  0.137 ,  0.3981, -1.0059],
                [ 0.4516,  1.4445,  0.0555, -0.0397,  0.44  ],
                [ 0.1629, -0.8473, -0.8223, -0.4621, -0.5137]])

为此，我们还将重新审视 SQLite3 并将其性能与使用 pandas 的替代格式进行比较。

SQL 数据库

至于 SQLite3 的一切，现在应该都很熟悉了。

In [90]: filename = path + 'numbers'
In [91]: con = sq3.Connection(filename + '.db')
In [92]: query = 'CREATE TABLE numbers (No1 real, No2 real,\
 No3 real, No4 real, No5 real)'  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [93]: q = con.execute
         qm = con.executemany
In [94]: q(query)
Out[94]: <sqlite3.Cursor at 0x1054e2260>

一张具有五列实数（float 对象）的表格。

这次，可以应用 .executemany() 方法，因为数据在一个单一的 ndarray 对象中可用。读取和处理数据与以前一样工作。查询结果也可以轻松可视化（参见图 9-1）。

In [95]: %%time
         qm('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', data)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
         con.commit()
         CPU times: user 7.16 s, sys: 147 ms, total: 7.3 s
         Wall time: 7.39 s
In [96]: ll $path*
         -rw-r--r--  1 yves  staff  52633600 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.db
In [97]: %%time
         temp = q('SELECT * FROM numbers').fetchall()  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
         print(temp[:3])
         [(0.4918, 1.3707, 0.137, 0.3981, -1.0059), (0.4516, 1.4445, 0.0555, -0.0397, 0.44), (0.1629, -0.8473, -0.8223, -0.4621, -0.5137)]
         CPU times: user 1.86 s, sys: 138 ms, total: 2 s
         Wall time: 2.07 s
In [98]: %%time
         query = 'SELECT * FROM numbers WHERE No1 > 0 AND No2 < 0'
         res = np.array(q(query).fetchall()).round(3)  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
         CPU times: user 770 ms, sys: 73.9 ms, total: 844 ms
         Wall time: 854 ms
In [99]: res = res[::100]  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         plt.figure(figsize=(10, 6))
         plt.plot(res[:, 0], res[:, 1], 'ro')  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
         plt.savefig('../../images/ch09/io_01.png');

将整个数据集一次性插入表中。

以单步操作从表中检索所有行。

检索行的选择并将其转换为 ndarray 对象。

绘制查询结果的子集。

图 9-1. 查询结果的散点图（选择）

从 SQL 到 pandas

一个通常更高效的方法，然而，是使用 pandas 读取整个表或查询结果。当您能够将整个表读入内存时，分析查询通常可以比使用 SQL 基于磁盘的方法执行得快得多。

使用 pandas 读取整个表与将其读入 NumPy ndarray 对象大致需要相同的时间。在这里和那里，瓶颈是 SQL 数据库。

In [100]: %time data = pd.read_sql('SELECT * FROM numbers', con)  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 2.11 s, sys: 175 ms, total: 2.29 s
          Wall time: 2.33 s
In [101]: data.head()
Out[101]:       No1     No2     No3     No4     No5
          0  0.4918  1.3707  0.1370  0.3981 -1.0059
          1  0.4516  1.4445  0.0555 -0.0397  0.4400
          2  0.1629 -0.8473 -0.8223 -0.4621 -0.5137
          3  1.3064  0.9125  0.5142 -0.7868 -0.3398
          4 -0.1148 -1.5215 -0.7045 -1.0042 -0.0600

将表的所有行读入名为 data 的 DataFrame 对象中。

数据现在在内存中。这样可以进行更快的分析。加速通常是一个数量级或更多。pandas 也可以处理更复杂的查询，尽管它既不意味着也不能替代 SQL 数据库，当涉及复杂的关系数据结构时。多个条件组合的查询结果显示在图 9-2 中。

In [102]: %time data[(data['No1'] > 0) & (data['No2'] < 0)].head()  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 19.4 ms, sys: 9.56 ms, total: 28.9 ms
          Wall time: 27.5 ms
Out[102]:        No1     No2     No3     No4     No5
          2   0.1629 -0.8473 -0.8223 -0.4621 -0.5137
          5   0.1893 -0.0207 -0.2104  0.9419  0.2551
          8   1.4784 -0.3333 -0.7050  0.3586 -0.3937
          10  0.8092 -0.9899  1.0364 -1.0453  0.0579
          11  0.9065 -0.7757 -0.9267  0.7797  0.0863
In [103]: %%time
          res = data[['No1', 'No2']][((data['No1'] > 0.5) | (data['No1'] < -0.5))
                               & ((data['No2'] < -1) | (data['No2'] > 1))]  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 20.6 ms, sys: 9.18 ms, total: 29.8 ms
          Wall time: 28 ms
In [104]: plt.figure(figsize=(10, 6))
          plt.plot(res['No1'], res['No2'], 'ro');
          plt.savefig('../../images/ch09/io_02.png');

两个条件逻辑上组合。

逻辑上组合了四个条件。

图 9-2. 查询结果的散点图（选择）

预期地，使用 pandas 的内存分析能力会显著加速，只要 pandas 能够复制相应的 SQL 语句。

使用 pandas 的另一个优点不仅仅是这个，因为 pandas 与 PyTables 等紧密集成 — 后续部分的主题。在这里，知道它们的组合可以显著加速 I/O 操作就足够了。如下所示：

In [105]: h5s = pd.HDFStore(filename + '.h5s', 'w')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
In [106]: %time h5s['data'] = data  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 33 ms, sys: 43.3 ms, total: 76.3 ms
          Wall time: 85.8 ms
In [107]: h5s  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
Out[107]: <class 'pandas.io.pytables.HDFStore'>
          File path: /Users/yves/Documents/Temp/data/numbers.h5s
In [108]: h5s.close()  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)

打开 HDF5 数据库文件进行写入；在 pandas 中创建一个 HDFStore 对象。

完整的 DataFrame 对象通过二进制存储存储在数据库文件中。

HDFStore 对象的信息。

关闭数据库文件。

与使用 SQLite3 相比，整个来自原始 SQL 表的所有数据的 DataFrame 写入速度快得多。读取甚至更快：

In [109]: %%time
          h5s = pd.HDFStore(filename + '.h5s', 'r')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          data_ = h5s['data']  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          h5s.close()  ![3](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/3.png)
          CPU times: user 8.24 ms, sys: 21.2 ms, total: 29.4 ms
          Wall time: 28.5 ms
In [110]: data_ is data  ![4](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/4.png)
Out[110]: False
In [111]: (data_ == data).all()  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[111]: No1    True
          No2    True
          No3    True
          No4    True
          No5    True
          dtype: bool
In [112]: np.allclose(data_, data)  ![5](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/5.png)
Out[112]: True
In [113]: ll $path*  ![6](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/6.png)
          -rw-r--r--  1 yves  staff  52633600 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.db
          -rw-r--r--  1 yves  staff  48007192 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.h5s

打开 HDF5 数据库文件进行读取。

DataFrame 被读取并存储在内存中作为 data_。

关闭数据库文件。

这两个 DataFrame 对象不相同。

然而，它们现在包含相同的数据。

与 SQL 表相比，二进制存储通常具有更小的大小开销。

CSV 文件中的数据

交换金融数据最广泛使用的格式之一是 CSV 格式。尽管它并没有真正标准化，但它可以被任何平台处理，并且绝大多数与数据和金融分析有关的应用程序都可以处理。前一节展示了如何使用标准 Python 功能将数据写入 CSV 文件并从 CSV 文件中读取数据（参见“读写文本文件”）。pandas 使得整个过程更加方便，代码更加简洁，并且总体执行更快（还可以参见图 9-3）：

In [114]: %time data.to_csv(filename + '.csv')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 6.82 s, sys: 277 ms, total: 7.1 s
          Wall time: 7.54 s
In [115]: ll $path
          total 282184
          -rw-r--r--  1 yves  staff  43834157 Jan 18 10:05 numbers.csv
          -rw-r--r--  1 yves  staff  52633600 Jan 18 10:05 numbers.db
          -rw-r--r--  1 yves  staff  48007192 Jan 18 10:05 numbers.h5s
In [116]: %time df = pd.read_csv(filename + '.csv')  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 1.4 s, sys: 124 ms, total: 1.53 s
          Wall time: 1.58 s
In [117]: df[['No1', 'No2', 'No3', 'No4']].hist(bins=20, figsize=(10, 6));
          plt.savefig('../../images/ch09/io_03.png');

.to_csv() 方法将 DataFrame 数据以 CSV 格式写入磁盘。

然后 pd.read_csv() 以新的 DataFrame 对象的形式将其再次读入内存。

图 9-3. 选定列的直方图

Excel 文件中的数据

尽管处理 Excel 电子表格是本书的后续章节的主题，但以下代码简要地演示了 pandas 如何以 Excel 格式写入数据并从 Excel 电子表格中读取数据。在这种情况下，我们将数据集限制为 100,000 行：

In [118]: %time data[:100000].to_excel(filename + '.xlsx')  ![1](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/1.png)
          CPU times: user 23.2 s, sys: 498 ms, total: 23.7 s
          Wall time: 23.9 s
In [119]: %time df = pd.read_excel(filename + '.xlsx', 'Sheet1')  ![2](https://gitee.com/OpenDocCN/ibooker-quant-zh/raw/master/docs/py-fin-2e/img/2.png)
          CPU times: user 5.47 s, sys: 74.7 ms, total: 5.54 s
          Wall time: 5.57 s
In [120]: df.cumsum().plot(figsize=(10, 6));
          plt.savefig('../../images/ch09/io_04.png');
In [121]: ll $path*
          -rw-r--r--  1 yves  staff  43834157 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.csv
          -rw-r--r--  1 yves  staff  52633600 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.db
          -rw-r--r--  1 yves  staff  48007192 Jan 18 10:05 /Users/yves/Documents/Temp/data/numbers.h5s
          -rw-r--r--  1 yves  staff   4032639 Jan 18 10:06 /Users/yves/Documents/Temp/data/numbers.xlsx
In [122]: rm -f $path*

.to_excel() 方法将 DataFrame 数据以 XLSX 格式写入磁盘。

然后 pd.read_excel() 以新的 DataFrame 对象的形式将其再次读入内存，同时指定要从中读取的工作表。

图 9-4. 所有列的线性图

生成包含较小数据子集的 Excel 电子表格文件需要相当长的时间。这说明了电子表格结构所带来的额外开销。

对生成的文件进行检查后发现，DataFrame 与 HDFStore 结合是最紧凑的选择（使用压缩，正如本章后面所述，进一步增加了优势）。与文本文件相比，作为 CSV 文件的相同数量的数据的大小要大一些。这是处理 CSV 文件时性能较慢的另一个原因，另一个原因是它们只是“普通”文本文件。

Python 金融编程第二版（GPT 重译）（四）（4）https://developer.aliyun.com/article/1559369

Python 金融编程第二版（GPT 重译）（四）（3）

读写文本文件

SQL 数据库

写入和读取 NumPy 数组

`pandas` 中的 I/O

SQL 数据库

图 9-1. 查询结果的散点图（选择）

从 SQL 到 pandas

图 9-2. 查询结果的散点图（选择）

CSV 文件中的数据

图 9-3. 选定列的直方图

Excel 文件中的数据

图 9-4. 所有列的线性图

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python 金融编程第二版（GPT 重译）（四）（3）

读写文本文件

SQL 数据库

写入和读取 NumPy 数组

pandas 中的 I/O

SQL 数据库

图 9-1. 查询结果的散点图（选择）

从 SQL 到 pandas

图 9-2. 查询结果的散点图（选择）

CSV 文件中的数据

图 9-3. 选定列的直方图

Excel 文件中的数据

图 9-4. 所有列的线性图

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

`pandas` 中的 I/O