Pandas IO工具精讲读写Excel HDF5与数据持久化-开发者社区-阿里云

Pandas 2.2 中文官方教程和指南（十·一）（3）https://developer.aliyun.com/article/1509774

Excel 文件

read_excel()方法可以使用openpyxl Python 模块读取 Excel 2007+（.xlsx）文件。可以使用xlrd读取 Excel 2003（.xls）文件。可以使用pyxlsb读取二进制 Excel（.xlsb）文件。所有格式都可以使用 calamine 引擎读取。to_excel()实例方法用于将DataFrame保存到 Excel。通常语义与处理 csv 数据类似。有关一些高级策略，请参阅 cookbook。

注意

当engine=None时，将使用以下逻辑确定引擎：

如果path_or_buffer是 OpenDocument 格式（.odf，.ods，.odt），那么将使用odf。
否则，如果path_or_buffer是 xls 格式，则将使用xlrd。
否则，如果path_or_buffer是 xlsb 格式，则将使用pyxlsb。
否则将使用openpyxl。

读取 Excel 文件

在最基本的用例中，read_excel接受 Excel 文件的路径，以及指示要解析哪个工作表的sheet_name。

在使用engine_kwargs参数时，pandas 将这些参数传递给引擎。因此，重要的是要知道 pandas 内部使用的函数。

对于引擎 openpyxl，pandas 使用openpyxl.load_workbook()来读取（.xlsx）和（.xlsm）文件。
对于引擎 xlrd，pandas 使用xlrd.open_workbook()来读取（.xls）文件。
对于引擎 pyxlsb，pandas 使用pyxlsb.open_workbook()来读取（.xlsb）文件。
对于引擎 odf，pandas 使用odf.opendocument.load()来读取（.ods）文件。
对于引擎 calamine，pandas 使用python_calamine.load_workbook()来读取（.xlsx）、（.xlsm）、（.xls）、（.xlsb）、（.ods）文件。

# Returns a DataFrame
pd.read_excel("path_to_file.xls", sheet_name="Sheet1")

`ExcelFile`类

为了方便处理同一文件中的多个工作表，可以使用ExcelFile类来包装文件，并可以将其传递给read_excel。读取多个工作表时将获得性能优势，因为文件只会读入内存一次。

xlsx = pd.ExcelFile("path_to_file.xls")
df = pd.read_excel(xlsx, "Sheet1")

ExcelFile类也可以用作上下文管理器。

with pd.ExcelFile("path_to_file.xls") as xls:
    df1 = pd.read_excel(xls, "Sheet1")
    df2 = pd.read_excel(xls, "Sheet2")

sheet_names属性将生成文件中工作表名称的列表。

ExcelFile的主要用例是使用不同参数解析多个工作表：

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)

请注意，如果所有工作表都使用相同的解析参数，则可以简单地将工作表名称列表传递给read_excel，而不会降低性能。

# using the ExcelFile class
data = {}
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=None, na_values=["NA"])
# equivalent using the read_excel function
data = pd.read_excel(
    "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"]
)

ExcelFile也可以使用xlrd.book.Book对象作为参数调用。这允许用户控制如何读取 Excel 文件。例如，可以通过调用xlrd.open_workbook()并使用on_demand=True来按需加载工作表。

import xlrd
xlrd_book = xlrd.open_workbook("path_to_file.xls", on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
    df1 = pd.read_excel(xls, "Sheet1")
    df2 = pd.read_excel(xls, "Sheet2") 
```#### 指定工作表
注意
第二个参数是`sheet_name`，不要与`ExcelFile.sheet_names`混淆。
注意
ExcelFile 的属性`sheet_names`提供对工作表列表的访问。
+   参数`sheet_name`允许指定要读取的工作表。
+   参数`sheet_name`的默认值为 0，表示读取第一个工作表
+   传递一个字符串来引用工作簿中特定工作表的名称。
+   传递一个整数来引用工作表的索引。索引遵循 Python 约定，从 0 开始。
+   传递一个字符串或整数列表，返回指定工作表的字典。
+   传递`None`返回所有可用工作表的字典。
```py
# Returns a DataFrame
pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"])

使用工作表索引：

# Returns a DataFrame
pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"])

使用所有默认值：

# Returns a DataFrame
pd.read_excel("path_to_file.xls")

使用 None 获取所有工作表：

# Returns a dictionary of DataFrames
pd.read_excel("path_to_file.xls", sheet_name=None)

使用列表获取多个工作表：

# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
pd.read_excel("path_to_file.xls", sheet_name=["Sheet1", 3])

read_excel可以通过将sheet_name设置为工作表名称列表、工作表位置列表或None来读取多个工作表。可以通过工作表索引或工作表名称指定工作表，分别使用整数或字符串。 #### 读取MultiIndex

read_excel可以通过将列列表传递给index_col和将行列表传递给header来读取MultiIndex索引。如果index或columns具有序列化级别名称，也可以通过指定构成级别的行/列来读取这些级别。

例如，要读取没有名称的MultiIndex索引：

In [424]: df = pd.DataFrame(
 .....:    {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]},
 .....:    index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]),
 .....: )
 .....: 
In [425]: df.to_excel("path_to_file.xlsx")
In [426]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
In [427]: df
Out[427]: 
 a  b
a c  1  5
 d  2  6
b c  3  7
 d  4  8

如果索引具有级别名称，则将使用相同的参数进行解析。

In [428]: df.index = df.index.set_names(["lvl1", "lvl2"])
In [429]: df.to_excel("path_to_file.xlsx")
In [430]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
In [431]: df
Out[431]: 
 a  b
lvl1 lvl2 
a    c     1  5
 d     2  6
b    c     3  7
 d     4  8

如果源文件既有 MultiIndex 索引又有列，则应将分别指定的列表传递给 index_col 和 header：

In [432]: df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1", "c2"])
In [433]: df.to_excel("path_to_file.xlsx")
In [434]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1])
In [435]: df
Out[435]: 
c1         a 
c2         b  d
lvl1 lvl2 
a    c     1  5
 d     2  6
b    c     3  7
 d     4  8

在指定了 index_col 中的列中的缺失值将被向前填充，以允许使用 to_excel 的 merged_cells=True 进行往返。为了避免向前填充缺失值，请在读取数据后使用 set_index 而不是 index_col。

解析特定列

在 Excel 中，用户经常会插入列进行临时计算，而您可能不想读取这些列。read_excel 接受一个 usecols 关键字，允许您指定要解析的列的子集。

您可以将逗号分隔的一组 Excel 列和范围指定为字符串：

pd.read_excel("path_to_file.xls", "Sheet1", usecols="A,C:E")

如果 usecols 是一个整数列表，则假定它是要解析的文件列索引。

pd.read_excel("path_to_file.xls", "Sheet1", usecols=[0, 2, 3])

元素顺序被忽略，因此 usecols=[0, 1] 与 [1, 0] 相同。

如果 usecols 是一个字符串列表，则假定每个字符串对应于用户在 names 中提供的列名或从文档标题行中推断出的列名。这些字符串定义了将要解析的列：

pd.read_excel("path_to_file.xls", "Sheet1", usecols=["foo", "bar"])

元素顺��被忽略，因此 usecols=['baz', 'joe'] 与 ['joe', 'baz'] 相同。

如果 usecols 是可调用的，则将对列名评估可调用函数，返回可调用函数评估为 True 的列名。

pd.read_excel("path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha())

解析日期

当读取 Excel 文件时，类似日期时间的值通常会自动转换为适当的 dtype。但是，如果您有一列看起来像日期的字符串（但实际上在 Excel 中没有格式化为日期），您可以使用 parse_dates 关键字将这些字符串解析为日期时间：

pd.read_excel("path_to_file.xls", "Sheet1", parse_dates=["date_strings"])

单元格转换器

可以通过 converters 选项转换 Excel 单元格的内容。例如，要将列转换为布尔值：

pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyBools": bool})

此选项处理缺失值，并将转换器中的异常视为缺失数据。转换是逐个单元格应用的，而不是整个列，因此不能保证数组 dtype。例如，具有缺失值的整数列无法转换为具有整数 dtype 的数组，因为 NaN 严格是浮点数。您可以手动屏蔽缺失数据以恢复整数 dtype：

def cfun(x):
    return int(x) if x else -1
pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyInts": cfun})

Dtype 规范

作为转换器的替代方案，可以使用 dtype 关键字指定整个列的类型，它接受一个将列名映射到类型的字典。要解释没有类型推断的数据，请使用类型 str 或 object。

pd.read_excel("path_to_file.xls", dtype={"MyInts": "int64", "MyText": str}) 
```### 写入 Excel 文件
#### 将 Excel 文件写入磁盘
要将 `DataFrame` 对象写入 Excel 文件的一个工作表中，可以使用 `to_excel` 实例方法。参数与上面描述的 `to_csv` 大致相同，第一个参数是 Excel 文件的名称，可选的第二个参数是应将 `DataFrame` 写入的工作表的名称。例如：
```py
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")

具有 .xlsx 扩展名的文件将使用 xlsxwriter（如果可用）或 openpyxl 进行写入。

DataFrame将以尽量模仿 REPL 输出的方式写入。index_label将放在第二行而不是第一行。您可以通过将to_excel()中的merge_cells选项设置为False将其放在第一行。

df.to_excel("path_to_file.xlsx", index_label="label", merge_cells=False) 
• 1

为了将单独的DataFrame写入单个 Excel 文件的不同工作表中，可以传递一个ExcelWriter。

with pd.ExcelWriter("path_to_file.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

当使用engine_kwargs参数时，pandas 将这些参数传递给引擎。因此，重要的是要知道 pandas 内部使用的是哪个函数。

对于引擎 openpyxl，pandas 使用openpyxl.Workbook()创建一个新工作表，使用openpyxl.load_workbook()将数据追加到现有工作表。openpyxl 引擎写入（.xlsx）和（.xlsm）文件。
对于引擎 xlsxwriter，pandas 使用xlsxwriter.Workbook()写入（.xlsx）文件。
对于引擎 odf，pandas 使用odf.opendocument.OpenDocumentSpreadsheet()写入（.ods）文件。

将 Excel 文件写入内存

pandas 支持将 Excel 文件写入类似缓冲区的对象，如StringIO或BytesIO，使用ExcelWriter。

from io import BytesIO
bio = BytesIO()
# By setting the 'engine' in the ExcelWriter constructor.
writer = pd.ExcelWriter(bio, engine="xlsxwriter")
df.to_excel(writer, sheet_name="Sheet1")
# Save the workbook
writer.save()
# Seek to the beginning and read to copy the workbook to a variable in memory
bio.seek(0)
workbook = bio.read()

注意

engine是可选的但建议设置。设置引擎确定生成的工作簿版本。设置engine='xlrd'将生成一个 Excel 2003 格式的工作簿（xls）。使用'openpyxl'或'xlsxwriter'将生成一个 Excel 2007 格式的工作簿（xlsx）。如果省略，将生成一个 Excel 2007 格式的工作簿。### Excel 写入器引擎

pandas 通过两种方法选择 Excel 写入器：

engine关键字参数
文件扩展名（通过配置选项中指定的默认值）

默认情况下，pandas 使用XlsxWriter用于.xlsx，openpyxl用于.xlsm。如果安装了多个引擎，可以通过设置配置选项io.excel.xlsx.writer和io.excel.xls.writer来设置默认引擎。如果Xlsxwriter不可用，pandas 将回退到openpyxl用于.xlsx文件。

要指定要使用的写入器，可以将引擎关键字参数传递给to_excel和ExcelWriter。内置引擎有：

openpyxl：需要 2.4 或更高版本
xlsxwriter

# By setting the 'engine' in the DataFrame 'to_excel()' methods.
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter")
# By setting the 'engine' in the ExcelWriter constructor.
writer = pd.ExcelWriter("path_to_file.xlsx", engine="xlsxwriter")
# Or via pandas configuration.
from pandas import options  # noqa: E402
options.io.excel.xlsx.writer = "xlsxwriter"
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1") 
```### 样式和格式
可以使用`DataFrame`的`to_excel`方法上的以下参数修改从 pandas 创建的 Excel 工作表的外观和感觉。
+   `float_format`：浮点数的格式字符串（默认为`None`）。
+   `freeze_panes`：表示要冻结的最底行和最右列的两个整数的元组。这些参数都是基于一的，因此(1, 1)将冻结第一行和第一列（默认为`None`）。
使用 [Xlsxwriter](https://xlsxwriter.readthedocs.io) 引擎提供了许多控制使用 `to_excel` 方法创建的 Excel 工作表格式的选项。在 [Xlsxwriter](https://xlsxwriter.readthedocs.io) 文档中可以找到出色的示例：[`xlsxwriter.readthedocs.io/working_with_pandas.html`](https://xlsxwriter.readthedocs.io/working_with_pandas.html)  ## OpenDocument 电子表格
Excel 文件 的 io 方法还支持使用 [odfpy](https://pypi.org/project/odfpy/) 模块读取和写入 OpenDocument 电子表格。读取和写入 OpenDocument 电子表格���语义和功能与使用 `engine='odf'` 可以为 Excel 文件 做的事情相匹配。需要安装可选依赖‘odfpy’。
`read_excel()` 方法可以读取 OpenDocument 电子表格
```py
# Returns a DataFrame
pd.read_excel("path_to_file.ods", engine="odf")

类似地，to_excel() 方法可以写入 OpenDocument 电子表格

# Writes DataFrame to a .ods file
df.to_excel("path_to_file.ods", engine="odf") 
```## 二进制 Excel（.xlsb）文件
`read_excel()` 方法还可以使用 `pyxlsb` 模块读取二进制 Excel 文件。读取二进制 Excel 文件的语义和功能大部分与使用 `engine='pyxlsb'` 可以为 Excel 文件 做的事情相匹配。`pyxlsb` 不识别文件中的日期时间类型，而会返回浮点数（如果需要识别日期时间类型，可以使用 calamine）。
```py
# Returns a DataFrame
pd.read_excel("path_to_file.xlsb", engine="pyxlsb")

注意

目前 pandas 只支持读取二进制 Excel 文件。写入尚未实现。 ## Calamine（Excel 和 ODS 文件）

read_excel() 方法可以使用 python-calamine 模块读取 Excel 文件（.xlsx, .xlsm, .xls, .xlsb）和 OpenDocument 电子表格（.ods）。该模块是 Rust 库 calamine 的绑定，大多数情况下比其他引擎更快。需要安装可选依赖python-calamine。

# Returns a DataFrame
pd.read_excel("path_to_file.xlsb", engine="calamine") 
```## 剪贴板
抓取数据的一个方便方法是使用 `read_clipboard()` 方法，它获取剪贴板缓冲区的内容并将其传递给 `read_csv` 方法。例如，您可以将以下文本复制到剪贴板（在许多操作系统上为 CTRL-C）：
```py
 A B C
x 1 4 p
y 2 5 q
z 3 6 r

然后通过调用以下方式直接将数据导入到 DataFrame 中：

>>> clipdf = pd.read_clipboard()
>>> clipdf
 A B C
x 1 4 p
y 2 5 q
z 3 6 r

to_clipboard 方法可用于将 DataFrame 的内容写入剪贴板。然后您可以将剪贴板内容粘贴到其他应用程序中（在许多操作系统上为 CTRL-V）。这里我们演示将 DataFrame 写入剪贴板并读取回来。

>>> df = pd.DataFrame(
...     {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"]
... )
>>> df
 A B C
x 1 4 p
y 2 5 q
z 3 6 r
>>> df.to_clipboard()
>>> pd.read_clipboard()
 A B C
x 1 4 p
y 2 5 q
z 3 6 r

我们可以看到，我们得到了之前写入剪贴板的相同内容。

注意

在 Linux 上，您可能需要安装 xclip 或 xsel（与 PyQt5、PyQt4 或 qtpy 一起）才能使用这些方法。 ## Pickling

所有 pandas 对象都配备有to_pickle方法，使用 Python 的cPickle模块将数据结构保存到磁盘使用 pickle 格式。

In [436]: df
Out[436]: 
c1         a 
c2         b  d
lvl1 lvl2 
a    c     1  5
 d     2  6
b    c     3  7
 d     4  8
In [437]: df.to_pickle("foo.pkl")

pandas命名空间中的read_pickle函数可用于从文件加载任何 pickled pandas 对象（或任何其他 pickled 对象）：

In [438]: pd.read_pickle("foo.pkl")
Out[438]: 
c1         a 
c2         b  d
lvl1 lvl2 
a    c     1  5
 d     2  6
b    c     3  7
 d     4  8

警告

从不受信任的来源接收 pickled 数据可能不安全。

参见：docs.python.org/3/library/pickle.html

警告

read_pickle() 仅向后兼容到几个次要版本。

压缩的 pickle 文件

read_pickle()、DataFrame.to_pickle()和Series.to_pickle()可以读取和写入压缩的 pickle 文件。支持gzip、bz2、xz、zstd的压缩类型用于读取和写入。zip文件格式仅支持读取，且必须只包含一个要读取的数据文件。

压缩类型可以是一个显式参数，也可以从文件扩展名中推断出来。如果是‘infer’，则在文件名以'.gz'、'.bz2'、'.zip'、'.xz'或'.zst'结尾时使用gzip、bz2、zip、xz或zstd。

压缩参数也可以是一个dict，以便传递选项给压缩协议。必须有一个设置为压缩协议名称的'method'键，必须是{'zip'、'gzip'、'bz2'、'xz'、'zstd'}之一。所有其他键值对都传递给底层压缩库。

In [439]: df = pd.DataFrame(
 .....:    {
 .....:        "A": np.random.randn(1000),
 .....:        "B": "foo",
 .....:        "C": pd.date_range("20130101", periods=1000, freq="s"),
 .....:    }
 .....: )
 .....: 
In [440]: df
Out[440]: 
 A    B                   C
0   -0.317441  foo 2013-01-01 00:00:00
1   -1.236269  foo 2013-01-01 00:00:01
2    0.896171  foo 2013-01-01 00:00:02
3   -0.487602  foo 2013-01-01 00:00:03
4   -0.082240  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.171092  foo 2013-01-01 00:16:35
996  1.786173  foo 2013-01-01 00:16:36
997 -0.575189  foo 2013-01-01 00:16:37
998  0.820750  foo 2013-01-01 00:16:38
999 -1.256530  foo 2013-01-01 00:16:39
[1000 rows x 3 columns]

使用显式压缩类型：

In [441]: df.to_pickle("data.pkl.compress", compression="gzip")
In [442]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")
In [443]: rt
Out[443]: 
 A    B                   C
0   -0.317441  foo 2013-01-01 00:00:00
1   -1.236269  foo 2013-01-01 00:00:01
2    0.896171  foo 2013-01-01 00:00:02
3   -0.487602  foo 2013-01-01 00:00:03
4   -0.082240  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.171092  foo 2013-01-01 00:16:35
996  1.786173  foo 2013-01-01 00:16:36
997 -0.575189  foo 2013-01-01 00:16:37
998  0.820750  foo 2013-01-01 00:16:38
999 -1.256530  foo 2013-01-01 00:16:39
[1000 rows x 3 columns]

从扩展名推断压缩类型：

In [444]: df.to_pickle("data.pkl.xz", compression="infer")
In [445]: rt = pd.read_pickle("data.pkl.xz", compression="infer")
In [446]: rt
Out[446]: 
 A    B                   C
0   -0.317441  foo 2013-01-01 00:00:00
1   -1.236269  foo 2013-01-01 00:00:01
2    0.896171  foo 2013-01-01 00:00:02
3   -0.487602  foo 2013-01-01 00:00:03
4   -0.082240  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.171092  foo 2013-01-01 00:16:35
996  1.786173  foo 2013-01-01 00:16:36
997 -0.575189  foo 2013-01-01 00:16:37
998  0.820750  foo 2013-01-01 00:16:38
999 -1.256530  foo 2013-01-01 00:16:39
[1000 rows x 3 columns]

默认为‘infer’：

In [447]: df.to_pickle("data.pkl.gz")
In [448]: rt = pd.read_pickle("data.pkl.gz")
In [449]: rt
Out[449]: 
 A    B                   C
0   -0.317441  foo 2013-01-01 00:00:00
1   -1.236269  foo 2013-01-01 00:00:01
2    0.896171  foo 2013-01-01 00:00:02
3   -0.487602  foo 2013-01-01 00:00:03
4   -0.082240  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.171092  foo 2013-01-01 00:16:35
996  1.786173  foo 2013-01-01 00:16:36
997 -0.575189  foo 2013-01-01 00:16:37
998  0.820750  foo 2013-01-01 00:16:38
999 -1.256530  foo 2013-01-01 00:16:39
[1000 rows x 3 columns]
In [450]: df["A"].to_pickle("s1.pkl.bz2")
In [451]: rt = pd.read_pickle("s1.pkl.bz2")
In [452]: rt
Out[452]: 
0     -0.317441
1     -1.236269
2      0.896171
3     -0.487602
4     -0.082240
 ... 
995   -0.171092
996    1.786173
997   -0.575189
998    0.820750
999   -1.256530
Name: A, Length: 1000, dtype: float64

传递选项给压缩协议以加快压缩速度：

In [453]: df.to_pickle("data.pkl.gz", compression={"method": "gzip", "compresslevel": 1}) 
```## msgpack
pandas 在 1.0.0 版本中移除了对`msgpack`的支持。建议改用 pickle。
或者，您也可以使用 Arrow IPC 序列化格式来传输 pandas 对象。有关 pyarrow 的文档，请参见[这里](https://arrow.apache.org/docs/python/ipc.html)。  ## HDF5（PyTables）
`HDFStore`是一个类似字典的对象，使用高性能 HDF5 格式读写 pandas，使用优秀的[PyTables](https://www.pytables.org/)库。查看 cookbook 了解一些高级策略
警告
pandas 使用 PyTables 来读写 HDF5 文件，允许使用 pickle 序列化对象数据。从不受信任的来源接收 pickled 数据可能不安全。
更多信息请参见：[`docs.python.org/3/library/pickle.html`](https://docs.python.org/3/library/pickle.html)。
```py
In [454]: store = pd.HDFStore("store.h5")
In [455]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

对象可以像向字典添加键值对一样写入文件：

In [456]: index = pd.date_range("1/1/2000", periods=8)
In [457]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
In [458]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])
# store.put('s', s) is an equivalent method
In [459]: store["s"] = s
In [460]: store["df"] = df
In [461]: store
Out[461]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

在当前或以后的 Python 会话中，您可以检索存储的对象：

# store.get('df') is an equivalent method
In [462]: store["df"]
Out[462]: 
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517
# dotted (attribute) access provides get as well
In [463]: store.df
Out[463]: 
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517

删除由键指定的对象：

# store.remove('df') is an equivalent method
In [464]: del store["df"]
In [465]: store
Out[465]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

关闭存储并使用上下文管理器：

In [466]: store.close()
In [467]: store
Out[467]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [468]: store.is_open
Out[468]: False
# Working with, and automatically closing the store using a context manager
In [469]: with pd.HDFStore("store.h5") as store:
 .....:    store.keys()
 .....:

读/写 API

HDFStore支持使用read_hdf进行读取和to_hdf进行写入的顶级 API，类似于read_csv和to_csv的工作方式。

In [470]: df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})
In [471]: df_tl.to_hdf("store_tl.h5", key="table", append=True)
In [472]: pd.read_hdf("store_tl.h5", "table", where=["index>2"])
Out[472]: 
 A  B
3  3  3
4  4  4

HDFStore 默认情况下不会删除所有缺失的行。可以通过设置dropna=True来更改此行为。

In [473]: df_with_missing = pd.DataFrame(
 .....:    {
 .....:        "col1": [0, np.nan, 2],
 .....:        "col2": [1, np.nan, np.nan],
 .....:    }
 .....: )
 .....: 
In [474]: df_with_missing
Out[474]: 
 col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN
In [475]: df_with_missing.to_hdf("file.h5", key="df_with_missing", format="table", mode="w")
In [476]: pd.read_hdf("file.h5", "df_with_missing")
Out[476]: 
 col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN
In [477]: df_with_missing.to_hdf(
 .....:    "file.h5", key="df_with_missing", format="table", mode="w", dropna=True
 .....: )
 .....: 
In [478]: pd.read_hdf("file.h5", "df_with_missing")
Out[478]: 
 col1  col2
0   0.0   1.0
2   2.0   NaN

固定格式

上面的示例显示了使用put进行存储，它将 HDF5 写入PyTables中的固定数组格式，称为fixed格式。这些类型的存储一旦写入就不可追加（尽管您可以简单地删除它们并重新写入）。它们也不可查询；必须完全检索它们。它们也不支持具有非唯一列名的数据框。使用put或to_hdf时，默认情况下指定fixed格式，或通过format='fixed'或format='f'指定。

警告

如果尝试使用where检索fixed格式，将引发TypeError：

In [479]: pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", key="df")
In [480]: pd.read_hdf("test_fixed.h5", "df", where="index>5")
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[480], line 1
----> 1 pd.read_hdf("test_fixed.h5", "df", where="index>5")
File ~/work/pandas/pandas/pandas/io/pytables.py:452, in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs)
  447                 raise ValueError(
  448                     "key must be provided when HDF5 "
  449                     "file contains multiple datasets."
  450                 )
  451         key = candidate_only_group._v_pathname
--> 452     return store.select(
  453         key,
  454         where=where,
  455         start=start,
  456         stop=stop,
  457         columns=columns,
  458         iterator=iterator,
  459         chunksize=chunksize,
  460         auto_close=auto_close,
  461     )
  462 except (ValueError, TypeError, LookupError):
  463     if not isinstance(path_or_buf, HDFStore):
  464         # if there is an error, close the store if we opened it.
File ~/work/pandas/pandas/pandas/io/pytables.py:906, in HDFStore.select(self, key, where, start, stop, columns, iterator, chunksize, auto_close)
  892 # create the iterator
  893 it = TableIterator(
  894     self,
  895     s,
   (...)
  903     auto_close=auto_close,
  904 )
--> 906 return it.get_result()
File ~/work/pandas/pandas/pandas/io/pytables.py:2029, in TableIterator.get_result(self, coordinates)
  2026     where = self.where
  2028 # directly return the result
-> 2029 results = self.func(self.start, self.stop, where)
  2030 self.close()
  2031 return results
File ~/work/pandas/pandas/pandas/io/pytables.py:890, in HDFStore.select.<locals>.func(_start, _stop, _where)
  889 def func(_start, _stop, _where):
--> 890     return s.read(start=_start, stop=_stop, where=_where, columns=columns)
File ~/work/pandas/pandas/pandas/io/pytables.py:3278, in BlockManagerFixed.read(self, where, columns, start, stop)
  3270 def read(
  3271     self,
  3272     where=None,
   (...)
  3276 ) -> DataFrame:
  3277     # start, stop applied to rows, so 0th axis only
-> 3278     self.validate_read(columns, where)
  3279     select_axis = self.obj_type()._get_block_manager_axis(0)
  3281     axes = []
File ~/work/pandas/pandas/pandas/io/pytables.py:2922, in GenericFixed.validate_read(self, columns, where)
  2917     raise TypeError(
  2918         "cannot pass a column specification when reading "
  2919         "a Fixed format store. this store must be selected in its entirety"
  2920     )
  2921 if where is not None:
-> 2922     raise TypeError(
  2923         "cannot pass a where specification when reading "
  2924         "from a Fixed format store. this store must be selected in its entirety"
  2925     )
TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety 
```### 表格格式
`HDFStore`支持另一种磁盘上的`PyTables`格式，即`table`格式。在概念上，`table`的形状非常类似于 DataFrame，具有行和列。`table`可以在相同或其他会话中追加。此外，支持删除和查询类型操作。通过`format='table'`或`format='t'`指定此格式以进行`append`或`put`或`to_hdf`。
还可以将此格式设置为选项`pd.set_option('io.hdf.default_format','table')`，以使`put/append/to_hdf`默认存储为`table`格式。
```py
In [481]: store = pd.HDFStore("store.h5")
In [482]: df1 = df[0:4]
In [483]: df2 = df[4:]
# append data (creates a table automatically)
In [484]: store.append("df", df1)
In [485]: store.append("df", df2)
In [486]: store
Out[486]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# select the entire object
In [487]: store.select("df")
Out[487]: 
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517
# the type of stored data
In [488]: store.root.df._v_attrs.pandas_type
Out[488]: 'frame_table'

注意

您还可以通过将format='table'或format='t'传递给put操作来创建一个table。 ### 分层键

存储的键可以指定为字符串。这些可以采用分层路径名称格式（例如foo/bar/bah），这将生成子存储（或PyTables术语中的Groups）的层次结构。键可以指定为没有前导‘/’的，并且始终是绝对的（例如，‘foo’指的是‘/foo’）。删除操作可以删除子存储中的所有内容以及以下内容，因此要小心。

In [489]: store.put("foo/bar/bah", df)
In [490]: store.append("food/orange", df)
In [491]: store.append("food/apple", df)
In [492]: store
Out[492]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# a list of keys are returned
In [493]: store.keys()
Out[493]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']
# remove all nodes under this level
In [494]: store.remove("food")
In [495]: store
Out[495]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

您可以使用walk方法遍历组层次结构，该方法将为每个组键生成一个元组，以及其内容的相对键。

In [496]: for (path, subgroups, subkeys) in store.walk():
 .....:    for subgroup in subgroups:
 .....:        print("GROUP: {}/{}".format(path, subgroup))
 .....:    for subkey in subkeys:
 .....:        key = "/".join([path, subkey])
 .....:        print("KEY: {}".format(key))
 .....:        print(store.get(key))
 .....: 
GROUP: /foo
KEY: /df
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517
GROUP: /foo/bar
KEY: /foo/bar/bah
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517

警告

无法像上面描述的在根节点下存储的项目那样，以点（属性）访问的方式检索分层键。

In [497]: store.foo.bar.bah
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[497], line 1
----> 1 store.foo.bar.bah
File ~/work/pandas/pandas/pandas/io/pytables.py:613, in HDFStore.__getattr__(self, name)
  611  """allow attribute access to get stores"""
  612 try:
--> 613     return self.get(name)
  614 except (KeyError, ClosedFileError):
  615     pass
File ~/work/pandas/pandas/pandas/io/pytables.py:813, in HDFStore.get(self, key)
  811 if group is None:
  812     raise KeyError(f"No object named {key} in the file")
--> 813 return self._read_group(group)
File ~/work/pandas/pandas/pandas/io/pytables.py:1878, in HDFStore._read_group(self, group)
  1877 def _read_group(self, group: Node):
-> 1878     s = self._create_storer(group)
  1879     s.infer_axes()
  1880     return s.read()
File ~/work/pandas/pandas/pandas/io/pytables.py:1752, in HDFStore._create_storer(self, group, format, value, encoding, errors)
  1750         tt = "generic_table"
  1751     else:
-> 1752         raise TypeError(
  1753             "cannot create a storer if the object is not existing "
  1754             "nor a value are passed"
  1755         )
  1756 else:
  1757     if isinstance(value, Series):
TypeError: cannot create a storer if the object is not existing nor a value are passed

# you can directly access the actual PyTables node but using the root node
In [498]: store.root.foo.bar.bah
Out[498]: 
/foo/bar/bah (Group) ''
 children := ['axis0' (Array), 'axis1' (Array), 'block0_items' (Array), 'block0_values' (Array)]

相反，使用显式的基于字符串的键：

In [499]: store["foo/bar/bah"]
Out[499]: 
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517 
```### 存储类型
#### 在表中存储混合类型
支持存储混合 dtype 数据。字符串以使用附加列的最大大小存储为固定宽度。尝试追加更长字符串将引发`ValueError`。
将 `min_itemsize={'values': size}` 作为附加参数传递给 append 将为字符串列设置更大的最小值。目前支持存储 `floats, strings, ints, bools, datetime64`。对于字符串列，将 `nan_rep = 'nan'` 传递给 append 将更改磁盘上的默认 nan 表示（将转换为/从 `np.nan`），默认为 `nan`。
```py
In [500]: df_mixed = pd.DataFrame(
 .....:    {
 .....:        "A": np.random.randn(8),
 .....:        "B": np.random.randn(8),
 .....:        "C": np.array(np.random.randn(8), dtype="float32"),
 .....:        "string": "string",
 .....:        "int": 1,
 .....:        "bool": True,
 .....:        "datetime64": pd.Timestamp("20010102"),
 .....:    },
 .....:    index=list(range(8)),
 .....: )
 .....: 
In [501]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan
In [502]: store.append("df_mixed", df_mixed, min_itemsize={"values": 50})
In [503]: df_mixed1 = store.select("df_mixed")
In [504]: df_mixed1
Out[504]: 
 A         B         C  ... int  bool                    datetime64
0  0.013747 -1.166078 -1.292080  ...   1  True 1970-01-01 00:00:00.978393600
1 -0.712009  0.247572  1.526911  ...   1  True 1970-01-01 00:00:00.978393600
2 -0.645096  1.687406  0.288504  ...   1  True 1970-01-01 00:00:00.978393600
3       NaN       NaN  0.097771  ...   1  True                           NaT
4       NaN       NaN  1.536408  ...   1  True                           NaT
5 -0.023202  0.043702  0.926790  ...   1  True 1970-01-01 00:00:00.978393600
6  2.359782  0.088224 -0.676448  ...   1  True 1970-01-01 00:00:00.978393600
7 -0.143428 -0.813360 -0.179724  ...   1  True 1970-01-01 00:00:00.978393600
[8 rows x 7 columns]
In [505]: df_mixed1.dtypes.value_counts()
Out[505]: 
float64           2
float32           1
object            1
int64             1
bool              1
datetime64[ns]    1
Name: count, dtype: int64
# we have provided a minimum string column size
In [506]: store.root.df_mixed.table
Out[506]: 
/df_mixed/table (Table(8,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
 "values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
 "values_block_2": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=3),
 "values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
 "values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
 "values_block_5": Int64Col(shape=(1,), dflt=0, pos=6)}
 byteorder := 'little'
 chunkshape := (689,)
 autoindex := True
 colindexes := {
 "index": Index(6, mediumshuffle, zlib(1)).is_csi=False}

Pandas 2.2 中文官方教程和指南（十·一）（4）

Excel 文件

读取 Excel 文件

`ExcelFile`类

解析特定列

解析日期

单元格转换器

Dtype 规范

将 Excel 文件写入内存

压缩的 pickle 文件

读/写 API

固定格式

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Pandas 2.2 中文官方教程和指南（十·一）（4）

Excel 文件

读取 Excel 文件

ExcelFile类

解析特定列

解析日期

单元格转换器

Dtype 规范

将 Excel 文件写入内存

压缩的 pickle 文件

读/写 API

固定格式

热门文章

最新文章

相关课程

相关电子书

`ExcelFile`类