Pandas 2.2 中文官方教程和指南（十·二）（1）-阿里云开发者社区

存储多级索引的 `DataFrames`

将多级索引的 DataFrames 存储为表与存储/选择同质索引的 DataFrames 非常相似。

In [507]: index = pd.MultiIndex(
 .....:   levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
 .....:   codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
 .....:   names=["foo", "bar"],
 .....: )
 .....: 
In [508]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
In [509]: df_mi
Out[509]: 
 A         B         C
foo bar 
foo one   -1.303456 -0.642994 -0.649456
 two    1.012694  0.414147  1.950460
 three  1.094544 -0.802899 -0.583343
bar one    0.410395  0.618321  0.560398
 two    1.434027 -0.033270  0.343197
baz two   -1.646063 -0.695847 -0.429156
 three -0.244688 -1.428229 -0.138691
qux one    1.866184 -1.446617  0.036660
 two   -1.660522  0.929553 -1.298649
 three  3.565769  0.682402  1.041927
In [510]: store.append("df_mi", df_mi)
In [511]: store.select("df_mi")
Out[511]: 
 A         B         C
foo bar 
foo one   -1.303456 -0.642994 -0.649456
 two    1.012694  0.414147  1.950460
 three  1.094544 -0.802899 -0.583343
bar one    0.410395  0.618321  0.560398
 two    1.434027 -0.033270  0.343197
baz two   -1.646063 -0.695847 -0.429156
 three -0.244688 -1.428229 -0.138691
qux one    1.866184 -1.446617  0.036660
 two   -1.660522  0.929553 -1.298649
 three  3.565769  0.682402  1.041927
# the levels are automatically included as data columns
In [512]: store.select("df_mi", "foo=bar")
Out[512]: 
 A         B         C
foo bar 
bar one  0.410395  0.618321  0.560398
 two  1.434027 -0.033270  0.343197

注意

index 关键字是保留的，不能用作级别名称。### 查询

查询表

select 和 delete 操作有一个可选的条件，可以指定选择/删除数据的子集。这允许在磁盘上有一个非常大的表，并且只检索数据的一部分。

使用 Term 类在底层指定查询，作为布尔表达式。

index 和 columns 是 DataFrames 的支持索引器。
如果指定了 data_columns，则可以将其用作额外的索引器。
多级索引中的级别名称，默认名称为 level_0、level_1，如果未提供。

有效的比较运算符有：

=, ==, !=, >, >=, <, <=

有效的布尔表达式与以下组合：

|：或
&：和
( 和 )：用于分组

这些规则类似于在 pandas 中用于索引的布尔表达式的使用方式。

注意

= 将自动扩展为比较运算符 ==
~ 是非运算符，但只能在非��有限的情况下使用
如果传递了表达式的列表/元组，它们将通过 & 组合。

以下是有效的表达式：

'index >= date'
"columns = ['A', 'D']"
"columns in ['A', 'D']"
'columns = A'
'columns == A'
"~(columns = ['A', 'B'])"
'index > df.index[3] & string = "bar"'
'(index > df.index[3] & index <= df.index[6]) | string = "bar"'
"ts >= Timestamp('2012-02-01')"
"major_axis>=20130101"

indexers 在子表达式的左侧：

columns、major_axis、ts

子表达式的右侧（比较运算符后）可以是：

将被评估的函数，例如Timestamp('2012-02-01')
字符串，例如"bar"
类似日期的格式，例如20130101，或"20130101"
列表，例如"['A', 'B']"
在本地命名空间中定义的变量，例如date

注意

不建议通过将字符串插入查询表达式来查询字符串。只需将感兴趣的字符串赋值给一个变量，并在表达式中使用该变量。例如，这样做

string = "HolyMoly'"
store.select("df", "index == string")

而不是这样

string = "HolyMoly'"
store.select('df', f'index == {string}')

后者将不起作用，并将引发 SyntaxError。请注意 string 变量中有一个单引号后跟一个双引号。

如果必须插值，请使用 '%r' 格式说明符

store.select("df", "index == %r" % string)

将引用 string。

以下是一些示例：

In [513]: dfq = pd.DataFrame(
 .....:    np.random.randn(10, 4),
 .....:    columns=list("ABCD"),
 .....:    index=pd.date_range("20130101", periods=10),
 .....: )
 .....: 
In [514]: store.append("dfq", dfq, format="table", data_columns=True)

使用内联列引用。

In [515]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[515]: 
 A         B
2013-01-05 -0.830545 -0.457071
2013-01-06  0.431186  1.049421
2013-01-07  0.617509 -0.811230
2013-01-08  0.947422 -0.671233
2013-01-09 -0.183798 -1.211230
2013-01-10  0.361428  0.887304

使用内联列引用。

In [516]: store.select("dfq", where="A>0 or C>0")
Out[516]: 
 A         B         C         D
2013-01-02  0.658179  0.362814 -0.917897  0.010165
2013-01-03  0.905122  1.848731 -1.184241  0.932053
2013-01-05 -0.830545 -0.457071  1.565581  1.148032
2013-01-06  0.431186  1.049421  0.383309  0.595013
2013-01-07  0.617509 -0.811230 -2.088563 -1.393500
2013-01-08  0.947422 -0.671233 -0.847097 -1.187785
2013-01-10  0.361428  0.887304  0.266457 -0.399641

columns 关键字可以用来选择要返回的列的列表，这相当于传递 'columns=list_of_columns_to_filter'：

In [517]: store.select("df", "columns=['A', 'B']")
Out[517]: 
 A         B
2000-01-01  0.858644 -0.851236
2000-01-02 -0.080372 -1.268121
2000-01-03  0.816983  1.965656
2000-01-04  0.712795 -0.062433
2000-01-05 -0.298721 -1.988045
2000-01-06  1.103675  1.382242
2000-01-07 -0.729161 -0.142928
2000-01-08 -1.005977  0.465222

可以指定 start 和 stop 参数以限制总搜索空间。这些是以表中总行数为单位的。

注意

如果查询表达式具有未知变量引用，则 select 将引发 ValueError。通常，这意味着您正在尝试选择一个不是数据列的列。

如果查询表达式无效，则 select 将引发 SyntaxError。

查询 timedelta64[ns]

您可以使用 timedelta64[ns] 类型进行存储和查询。时间间隔可以以 () 的格式指定，其中浮点数可以是有符号的（也可以是分数），单位可以是 D，s，ms，us，ns 用于时间间隔。以下是一个示例：

In [518]: from datetime import timedelta
In [519]: dftd = pd.DataFrame(
 .....:    {
 .....:        "A": pd.Timestamp("20130101"),
 .....:        "B": [
 .....:            pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
 .....:            for i in range(10)
 .....:        ],
 .....:    }
 .....: )
 .....: 
In [520]: dftd["C"] = dftd["A"] - dftd["B"]
In [521]: dftd
Out[521]: 
 A                   B                  C
0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
In [522]: store.append("dftd", dftd, data_columns=True)
In [523]: store.select("dftd", "C<'-3.5D'")
Out[523]: 
 A                   B                  C
4 1970-01-01 00:00:01.356998400 2013-01-05 00:00:10  -5 days +23:59:50
5 1970-01-01 00:00:01.356998400 2013-01-06 00:00:10  -6 days +23:59:50
6 1970-01-01 00:00:01.356998400 2013-01-07 00:00:10  -7 days +23:59:50
7 1970-01-01 00:00:01.356998400 2013-01-08 00:00:10  -8 days +23:59:50
8 1970-01-01 00:00:01.356998400 2013-01-09 00:00:10  -9 days +23:59:50
9 1970-01-01 00:00:01.356998400 2013-01-10 00:00:10 -10 days +23:59:50 
```#### 查询 MultiIndex
通过使用级别的名称可以实现从 `MultiIndex` 中选择。
```py
In [524]: df_mi.index.names
Out[524]: FrozenList(['foo', 'bar'])
In [525]: store.select("df_mi", "foo=baz and bar=two")
Out[525]: 
 A         B         C
foo bar 
baz two -1.646063 -0.695847 -0.429156

如果 MultiIndex 级别名称为 None，则可以通过 level_n 关键字自动使用 level_n 选择 MultiIndex 的级别。

In [526]: index = pd.MultiIndex(
 .....:    levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
 .....:    codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
 .....: )
 .....: 
In [527]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
In [528]: df_mi_2
Out[528]: 
 A         B         C
foo one   -0.219582  1.186860 -1.437189
 two    0.053768  1.872644 -1.469813
 three -0.564201  0.876341  0.407749
bar one   -0.232583  0.179812  0.922152
 two   -1.820952 -0.641360  2.133239
baz two   -0.941248 -0.136307 -1.271305
 three -0.099774 -0.061438 -0.845172
qux one    0.465793  0.756995 -0.541690
 two   -0.802241  0.877657 -2.553831
 three  0.094899 -2.319519  0.293601
In [529]: store.append("df_mi_2", df_mi_2)
# the levels are automatically included as data columns with keyword level_n
In [530]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[530]: 
 A         B         C
foo two  0.053768  1.872644 -1.469813

索引

您可以在数据已经在表中的情况下（在 append/put 操作之后）使用 create_table_index 为表创建/修改索引。强烈建议创建表索引。当您使用具有索引维度作为 where 的 select 时，这将大大加快查询速度。

注意

索引会自动创建在可索引和您指定的任何数据列上。通过向 append 传递 index=False 可以关闭此行为。

# we have automagically already created an index (in the first section)
In [531]: i = store.root.df.table.cols.index.index
In [532]: i.optlevel, i.kind
Out[532]: (6, 'medium')
# change an index by passing new parameters
In [533]: store.create_table_index("df", optlevel=9, kind="full")
In [534]: i = store.root.df.table.cols.index.index
In [535]: i.optlevel, i.kind
Out[535]: (9, 'full')

在向存储附加大量数据时，通常很有用关闭每次附加的索引创建，然后在最后重新创建。

In [536]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
In [537]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
In [538]: st = pd.HDFStore("appends.h5", mode="w")
In [539]: st.append("df", df_1, data_columns=["B"], index=False)
In [540]: st.append("df", df_2, data_columns=["B"], index=False)
In [541]: st.get_storer("df").table
Out[541]: 
/df/table (Table(20,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
 "B": Float64Col(shape=(), dflt=0.0, pos=2)}
 byteorder := 'little'
 chunkshape := (2730,)

然后在附加完成后创建索引。

In [542]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")
In [543]: st.get_storer("df").table
Out[543]: 
/df/table (Table(20,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
 "B": Float64Col(shape=(), dflt=0.0, pos=2)}
 byteorder := 'little'
 chunkshape := (2730,)
 autoindex := True
 colindexes := {
 "B": Index(9, fullshuffle, zlib(1)).is_csi=True}
In [544]: st.close()

请参阅这里如何在现有存储上创建完全排序索引（CSI）。

通过数据列查询

您可以指定（并索引）您希望能够执行查询的特定列（除了可始终查询的 indexable 列之外）。例如，假设您想要执行此常见操作，在磁盘上，并仅返回与此查询匹配的框架。您可以指定 data_columns = True 来强制所有列都成为 data_columns。

In [545]: df_dc = df.copy()
In [546]: df_dc["string"] = "foo"
In [547]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan
In [548]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"
In [549]: df_dc["string2"] = "cool"
In [550]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0
In [551]: df_dc
Out[551]: 
 A         B         C string string2
2000-01-01  0.858644 -0.851236  1.058006    foo    cool
2000-01-02 -0.080372  1.000000  1.000000    foo    cool
2000-01-03  0.816983  1.000000  1.000000    foo    cool
2000-01-04  0.712795 -0.062433  0.736755    foo    cool
2000-01-05 -0.298721 -1.988045  1.475308    NaN    cool
2000-01-06  1.103675  1.382242 -0.650762    NaN    cool
2000-01-07 -0.729161 -0.142928 -1.063038    foo    cool
2000-01-08 -1.005977  0.465222 -0.094517    bar    cool
# on-disk operations
In [552]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
In [553]: store.select("df_dc", where="B > 0")
Out[553]: 
 A         B         C string string2
2000-01-02 -0.080372  1.000000  1.000000    foo    cool
2000-01-03  0.816983  1.000000  1.000000    foo    cool
2000-01-06  1.103675  1.382242 -0.650762    NaN    cool
2000-01-08 -1.005977  0.465222 -0.094517    bar    cool
# getting creative
In [554]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
Out[554]: 
 A    B    C string string2
2000-01-02 -0.080372  1.0  1.0    foo    cool
2000-01-03  0.816983  1.0  1.0    foo    cool
# this is in-memory version of this type of selection
In [555]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[555]: 
 A    B    C string string2
2000-01-02 -0.080372  1.0  1.0    foo    cool
2000-01-03  0.816983  1.0  1.0    foo    cool
# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [556]: store.root.df_dc.table
Out[556]: 
/df_dc/table (Table(8,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
 "B": Float64Col(shape=(), dflt=0.0, pos=2),
 "C": Float64Col(shape=(), dflt=0.0, pos=3),
 "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
 "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
 byteorder := 'little'
 chunkshape := (1680,)
 autoindex := True
 colindexes := {
 "index": Index(6, mediumshuffle, zlib(1)).is_csi=False,
 "B": Index(6, mediumshuffle, zlib(1)).is_csi=False,
 "C": Index(6, mediumshuffle, zlib(1)).is_csi=False,
 "string": Index(6, mediumshuffle, zlib(1)).is_csi=False,
 "string2": Index(6, mediumshuffle, zlib(1)).is_csi=False}

将许多列转换为 data columns 会导致一些性能下降，因此用户需要指定这些列。此外，在第一次附加/放置操作之后，您不能更改数据列（也不能更改索引列）（当然，您可以简单地读取数据并创建新表！）。

迭代器

您可以将 iterator=True 或 chunksize=number_in_a_chunk 传递给 select 和 select_as_multiple 以返回结果的迭代器。默认情况下，每次返回 50,000 行。

In [557]: for df in store.select("df", chunksize=3):
 .....:    print(df)
 .....: 
 A         B         C
2000-01-01  0.858644 -0.851236  1.058006
2000-01-02 -0.080372 -1.268121  1.561967
2000-01-03  0.816983  1.965656 -1.169408
 A         B         C
2000-01-04  0.712795 -0.062433  0.736755
2000-01-05 -0.298721 -1.988045  1.475308
2000-01-06  1.103675  1.382242 -0.650762
 A         B         C
2000-01-07 -0.729161 -0.142928 -1.063038
2000-01-08 -1.005977  0.465222 -0.094517

注意

您还可以使用带有 read_hdf 的迭代器，该迭代器在完成迭代时会自动打开然后关闭存储。

for df in pd.read_hdf("store.h5", "df", chunksize=3):
    print(df)

请注意，chunksize 关键字适用于源行。因此，如果你正在进行一个查询，那么 chunksize 将把表中的总行数细分，并应用查询，返回一个可能大小不等的块的迭代器。

这里有一个生成查询并使用它创建相等大小返回块的方法。

In [558]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})
In [559]: dfeq
Out[559]: 
 number
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10
In [560]: store.append("dfeq", dfeq, data_columns=["number"])
In [561]: def chunks(l, n):
 .....:    return [l[i: i + n] for i in range(0, len(l), n)]
 .....: 
In [562]: evens = [2, 4, 6, 8, 10]
In [563]: coordinates = store.select_as_coordinates("dfeq", "number=evens")
In [564]: for c in chunks(coordinates, 2):
 .....:    print(store.select("dfeq", where=c))
 .....: 
 number
1       2
3       4
 number
5       6
7       8
 number
9      10

高级查询

选择单列

要检索单个可索引或数据列，请使用方法select_column。这将使你能够快速获取索引。这些返回一个结果的Series，由行号索引。目前这些方法不接受where选择器。

In [565]: store.select_column("df_dc", "index")
Out[565]: 
0   2000-01-01
1   2000-01-02
2   2000-01-03
3   2000-01-04
4   2000-01-05
5   2000-01-06
6   2000-01-07
7   2000-01-08
Name: index, dtype: datetime64[ns]
In [566]: store.select_column("df_dc", "string")
Out[566]: 
0    foo
1    foo
2    foo
3    foo
4    NaN
5    NaN
6    foo
7    bar
Name: string, dtype: object

选择坐标

有时候你想要获取查询的坐标（也就是索引位置）。这将返回结果位置的Index。这些坐标也可以传递给后续的where操作。

In [567]: df_coord = pd.DataFrame(
 .....:    np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
 .....: )
 .....: 
In [568]: store.append("df_coord", df_coord)
In [569]: c = store.select_as_coordinates("df_coord", "index > 20020101")
In [570]: c
Out[570]: 
Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
 ...
 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
 dtype='int64', length=268)
In [571]: store.select("df_coord", where=c)
Out[571]: 
 0         1
2002-01-02  0.007717  1.168386
2002-01-03  0.759328 -0.638934
2002-01-04 -1.154018 -0.324071
2002-01-05 -0.804551 -1.280593
2002-01-06 -0.047208  1.260503
...              ...       ...
2002-09-22 -1.139583  0.344316
2002-09-23 -0.760643 -1.306704
2002-09-24  0.059018  1.775482
2002-09-25  1.242255 -0.055457
2002-09-26  0.410317  2.194489
[268 rows x 2 columns] 
```##### 使用 where 掩码进行选择
有时候你的查询可能涉及创建一个要选择的行列表。通常这个`mask`会是一个索引操作的结果`index`。这个示例选择了一个 datetimeindex 中为 5 的月份。
```py
In [572]: df_mask = pd.DataFrame(
 .....:    np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
 .....: )
 .....: 
In [573]: store.append("df_mask", df_mask)
In [574]: c = store.select_column("df_mask", "index")
In [575]: where = c[pd.DatetimeIndex(c).month == 5].index
In [576]: store.select("df_mask", where=where)
Out[576]: 
 0         1
2000-05-01  1.479511  0.516433
2000-05-02 -0.334984 -1.493537
2000-05-03  0.900321  0.049695
2000-05-04  0.614266 -1.077151
2000-05-05  0.233881  0.493246
...              ...       ...
2002-05-27  0.294122  0.457407
2002-05-28 -1.102535  1.215650
2002-05-29 -0.432911  0.753606
2002-05-30 -1.105212  2.311877
2002-05-31  2.567296  2.610691
[93 rows x 2 columns]

存储器对象

如果你想要检查存储的对象，请通过get_storer检索。你可以在程序中使用这个方法来获取对象中的行数。

In [577]: store.get_storer("df_dc").nrows
Out[577]: 8

多表查询

方法append_to_multiple和select_as_multiple可以同时从多个表中执行追加/选择操作。其思想是有一个表（称之为选择器表），你在这个表中索引大部分/全部列，并执行你的查询。其他表是数据表，其索引与选择器表的索引匹配。然后你可以在选择器表上执行非常快速的查询，同时获取大量数据。这种方法类似于拥有一个非常宽的表，但能够实现更高效的查询。

append_to_multiple方法根据d，一个将表名映射到你想要在该表中的‘列’列表的字典，将给定的单个 DataFrame 拆分成多个表。如果在列表的位置使用None，那么该表将具有给定 DataFrame 的其余未指定的列。参数selector定义了哪个表是选择器表（你可以从中进行查询）。参数dropna将从输入的DataFrame中删除行，以确保表同步。这意味着如果要写入的表中的一行完全由np.nan组成，那么该行将从所有表中删除。

如果dropna为 False，用户需要负责同步表格。请记住，完全由np.Nan行组成的行不会被写入 HDFStore，因此如果选择调用dropna=False，某些表可能比其他表有更多的行，因此select_as_multiple可能无法工作，或者可能返回意外结果。

In [578]: df_mt = pd.DataFrame(
 .....:    np.random.randn(8, 6),
 .....:    index=pd.date_range("1/1/2000", periods=8),
 .....:    columns=["A", "B", "C", "D", "E", "F"],
 .....: )
 .....: 
In [579]: df_mt["foo"] = "bar"
In [580]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan
# you can also create the tables individually
In [581]: store.append_to_multiple(
 .....:    {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
 .....: )
 .....: 
In [582]: store
Out[582]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# individual tables were created
In [583]: store.select("df1_mt")
Out[583]: 
 A         B
2000-01-01  0.162291 -0.430489
2000-01-02       NaN       NaN
2000-01-03  0.429207 -1.099274
2000-01-04  1.869081 -1.466039
2000-01-05  0.092130 -1.726280
2000-01-06  0.266901 -0.036854
2000-01-07 -0.517871 -0.990317
2000-01-08 -0.231342  0.557402
In [584]: store.select("df2_mt")
Out[584]: 
 C         D         E         F  foo
2000-01-01 -2.502042  0.668149  0.460708  1.834518  bar
2000-01-02  0.130441 -0.608465  0.439872  0.506364  bar
2000-01-03 -1.069546  1.236277  0.116634 -1.772519  bar
2000-01-04  0.137462  0.313939  0.748471 -0.943009  bar
2000-01-05  0.836517  2.049798  0.562167  0.189952  bar
2000-01-06  1.112750 -0.151596  1.503311  0.939470  bar
2000-01-07 -0.294348  0.335844 -0.794159  1.495614  bar
2000-01-08  0.860312 -0.538674 -0.541986 -1.759606  bar
# as a multiple
In [585]: store.select_as_multiple(
 .....:    ["df1_mt", "df2_mt"],
 .....:    where=["A>0", "B>0"],
 .....:    selector="df1_mt",
 .....: )
 .....: 
Out[585]: 
Empty DataFrame
Columns: [A, B, C, D, E, F, foo]
Index: []

从表中删除

您可以通过指定where有选择性地从表中删除。在删除行时，重要的是要了解PyTables通过擦除行然后移动后续数据来删除行。因此，删除操作可能是一个非常昂贵的操作，具体取决于数据的方向。为了获得最佳性能，最好让您要删除的维度成为indexables的第一个维度。

数据按照indexables的顺序（在磁盘上）进行排序。这里有一个简单的用例。你存储面板类型的数据，日期在major_axis中，id 在minor_axis中。然后数据被交错存储如下：

date_1

id_1
id_2
.
id_n

date_2

id_1
.
id_n

应该清楚，对major_axis进行删除操作会相当快，因为一个块被移除，然后后续数据被移动。另一方面，对minor_axis进行删除操作将非常昂贵。在这种情况下，重新编写使用where选择除缺失数据外的所有数据的表几乎肯定会更快。

警告

请注意，HDF5 不会自动回收 h5 文件中的空间。因此，反复删除（或移除节点）然后再添加，会增加文件大小。

若要重新打包和清理文件，请使用 ptrepack。

注意事项 & 警告

压缩

PyTables允许对存储的数据进行压缩。这适用于��有类型的存储，不仅仅是表格。用于控制压缩的两个参数是complevel和complib。

complevel指定数据压缩的难度。complevel=0和complevel=None禁用压缩，0启用压缩。
complib指定要使用的压缩库。如果未指定任何内容，则使用默认库zlib。压缩库通常会针对良好的压缩率或速度进行优化，结果将取决于数据类型。选择哪种类型的压缩取决于您的具体需求和数据。支持的压缩库列表：

zlib：默认的压缩库。在压缩方面经典，能够获得很高的压缩率，但速度有些慢。
lzo：快速的压缩和解压。
bzip2：压缩率很高。
blosc：快速的压缩和解压。支持替代的 blosc 压缩器：

blosc:blosclz 这是blosc的默认压缩器
blosc:lz4：紧凑、非常流行且快速的压缩器。
blosc:lz4hc：LZ4 的改进版本，在牺牲速度的情况下产生更好的压缩比。
blosc:snappy：在许多地方使用的流行压缩器。
blosc:zlib：经典；比前几个稍慢，但实现更好的压缩比。
blosc:zstd：一个极其平衡的编解码器；它在以上其他编解码器中提供最佳的��缩比，并且速度相当快。

如果complib被定义为除列出的库之外的内容，则会引发ValueError异常。

注意

如果在您的平台上缺少complib选项指定的库，则压缩默认为zlib，无需进一步操作。

为文件中的所有对象启用压缩：

store_compressed = pd.HDFStore(
    "store_compressed.h5", complevel=9, complib="blosc:blosclz"
)

或在未启用压缩的存储中进行即时压缩（仅适用于表）：

store.append("df", df, complib="zlib", complevel=5)

`ptrepack`

当表在写入后进行压缩时，PyTables提供更好的写入性能，而不是在一开始就打开压缩。您可以使用提供的PyTables实用程序ptrepack。此外，ptrepack可以在事后更改压缩级别。

ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5

此外，ptrepack in.h5 out.h5将重新打包文件，以便您可以重用先前删除的空间。或者，可以简单地删除文件并重新写入，或者使用copy方法。 #### 注意事项

警告

HDFStore对于写入不是线程安全的。底层的PyTables仅支持并发读取（通过线程或进程）。如果您需要同时进行读取和写入，您需要在单个线程中的单个进程中串行化这些操作。否则，您的数据将被破坏。有关更多信息，请参见(GH 2397)。

如果您使用锁来管理多个进程之间的写入访问权限，可能需要在释放写入锁之前使用fsync()。为了方便起见，您可以使用store.flush(fsync=True)来为您执行此操作。
一旦创建了table，列（DataFrame）就是固定的；只能追加完全相同的列
请注意时区（例如，pytz.timezone('US/Eastern')）在不同时区版本之间不一定相等。因此，如果使用一个版本的时区库将数据本地化到 HDFStore 中的特定时区，并且使用另一个版本更新数据，则数据将被转换为 UTC，因为这些时区不被视为相等。要么使用相同版本的时区库，要么使用带有更新时区定义的tz_convert。

警告

如果列名不能用作属性选择器，则PyTables将显示NaturalNameWarning。自然标识符仅包含字母、数字和下划线，并且不能以数字开头。其他标识符不能在where子句中使用，通常是一个坏主意。 ### 数据类型

HDFStore将对象 dtype 映射到PyTables底层 dtype。这意味着以下类型已知可用：

类型	表示缺失值
floating : `float64, float32, float16`	`np.nan`
integer : `int64, int32, int8, uint64,uint32, uint8`
布尔值
`datetime64[ns]`	`NaT`
`timedelta64[ns]`	`NaT`
分类：请参见下面的部分
object：`strings`	`np.nan`

不支持unicode列，将失败。

`分类数据`

您可以将包含category dtypes 的数据写入HDFStore。查询的工作方式与对象数组相同。但是，category dtyped 数据以更有效的方式存储。

In [586]: dfcat = pd.DataFrame(
 .....:    {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
 .....: )
 .....: 
In [587]: dfcat
Out[587]: 
 A         B
0  a -1.520478
1  a -1.069391
2  b -0.551981
3  b  0.452407
4  c  0.409257
5  d  0.301911
6  b -0.640843
7  a -2.253022
In [588]: dfcat.dtypes
Out[588]: 
A    category
B     float64
dtype: object
In [589]: cstore = pd.HDFStore("cats.h5", mode="w")
In [590]: cstore.append("dfcat", dfcat, format="table", data_columns=["A"])
In [591]: result = cstore.select("dfcat", where="A in ['b', 'c']")
In [592]: result
Out[592]: 
 A         B
2  b -0.551981
3  b  0.452407
4  c  0.409257
6  b -0.640843
In [593]: result.dtypes
Out[593]: 
A    category
B     float64
dtype: object

`字符串列`

min_itemsize

HDFStore的底层实现对字符串列使用固定的列宽（itemsize）。字符串列的 itemsize 是在第一次追加时传递给HDFStore的数据的长度的最大值。后续的追加可能会引入一个比列能容纳的更大的字符串，将引发异常（否则可能会对这些列进行静默截断，导致信息丢失）。在未来，我们可能会放宽这一限制，允许用户指定截断。

在第一次创建表时传递min_itemsize，以先验指定特定字符串列的最小长度。min_itemsize可以是一个整数，或将列名映射到整数的字典。您可以将values作为一个键传递，以允许所有可索引或data_columns具有此最小长度。

传递min_itemsize字典将导致所有传递的列自动创建为data_columns。

注意

如果没有传递任何data_columns，那么min_itemsize将是传递的任何字符串的长度的最大值

In [594]: dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))
In [595]: dfs
Out[595]: 
 A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar
# A and B have a size of 30
In [596]: store.append("dfs", dfs, min_itemsize=30)
In [597]: store.get_storer("dfs").table
Out[597]: 
/dfs/table (Table(5,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
 byteorder := 'little'
 chunkshape := (963,)
 autoindex := True
 colindexes := {
 "index": Index(6, mediumshuffle, zlib(1)).is_csi=False}
# A is created as a data_column with a size of 30
# B is size is calculated
In [598]: store.append("dfs2", dfs, min_itemsize={"A": 30})
In [599]: store.get_storer("dfs2").table
Out[599]: 
/dfs2/table (Table(5,)) ''
 description := {
 "index": Int64Col(shape=(), dflt=0, pos=0),
 "values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
 "A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
 byteorder := 'little'
 chunkshape := (1598,)
 autoindex := True
 colindexes := {
 "index": Index(6, mediumshuffle, zlib(1)).is_csi=False,
 "A": Index(6, mediumshuffle, zlib(1)).is_csi=False}

nan_rep

字符串列将使用nan_rep字符串表示来序列化np.nan（缺失值）。默认为字符串值nan。您可能会无意中将实际的nan值转换为缺失值。

In [600]: dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})
In [601]: dfss
Out[601]: 
 A
0  foo
1  bar
2  nan
In [602]: store.append("dfss", dfss)
In [603]: store.select("dfss")
Out[603]: 
 A
0  foo
1  bar
2  NaN
# here you need to specify a different nan rep
In [604]: store.append("dfss2", dfss, nan_rep="_nan_")
In [605]: store.select("dfss2")
Out[605]: 
 A
0  foo
1  bar
2  nan

`性能`

与fixed存储相比，tables格式会带来写入性能的损失。好处在于能够追加/删除和查询（可能是非常大量的数据）。与常规存储相比，写入时间通常更长。查询时间可能非常快，特别是在索引轴上。
您可以通过在append中传递chunksize=来指定写入的块大小（默认为 50000）。这将显著降低写入时的内存使用。
您可以通过在第一次append中传递expectedrows=来设置PyTables预期的总行数。这将优化读/写性能。
可以将重复行写入表中，但在选择时会被过滤掉（选择最后的项目；因此表在主要、次要对上是唯一的）
如果您尝试存储将由 PyTables 进行 pickle 处理的类型（而不是作为固有类型存储），将会引发PerformanceWarning。有关更多信息和一些解决方案，请参见这里。 ## Feather

Feather 为数据框提供了二进制列序列化。它旨在使数据框的读写高效，并使数据在数据分析语言之间的共享变得容易。

Feather 旨在忠实地序列化和反序列化 DataFrames，支持所有 pandas 的数据类型，包括分类和带有时区的日期时间等扩展数据类型。

几个注意事项：

该格式不会为 DataFrame 写入 Index 或 MultiIndex，如果提供了非默认的索引，则会引发错误。您可以使用 .reset_index() 存储索引，或使用 .reset_index(drop=True) 忽略它。
不支持重复的列名和非字符串的列名
不支持对象数据类型列中的实际 Python 对象。在尝试序列化时，这些将引发一个有用的错误消息。

查看完整文档。

In [606]: df = pd.DataFrame(
 .....:    {
 .....:        "a": list("abc"),
 .....:        "b": list(range(1, 4)),
 .....:        "c": np.arange(3, 6).astype("u1"),
 .....:        "d": np.arange(4.0, 7.0, dtype="float64"),
 .....:        "e": [True, False, True],
 .....:        "f": pd.Categorical(list("abc")),
 .....:        "g": pd.date_range("20130101", periods=3),
 .....:        "h": pd.date_range("20130101", periods=3, tz="US/Eastern"),
 .....:        "i": pd.date_range("20130101", periods=3, freq="ns"),
 .....:    }
 .....: )
 .....: 
In [607]: df
Out[607]: 
 a  b  c  ...          g                         h                             i
0  a  1  3  ... 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
1  b  2  4  ... 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
2  c  3  5  ... 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002
[3 rows x 9 columns]
In [608]: df.dtypes
Out[608]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                      category
g                datetime64[ns]
h    datetime64[ns, US/Eastern]
i                datetime64[ns]
dtype: object

写入一个 feather 文件。

In [609]: df.to_feather("example.feather")

从一个 feather 文件中读取。

In [610]: result = pd.read_feather("example.feather")
In [611]: result
Out[611]: 
 a  b  c  ...          g                         h                             i
0  a  1  3  ... 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
1  b  2  4  ... 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
2  c  3  5  ... 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002
[3 rows x 9 columns]
# we preserve dtypes
In [612]: result.dtypes
Out[612]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                      category
g                datetime64[ns]
h    datetime64[ns, US/Eastern]
i                datetime64[ns]
dtype: object 
```## Parquet
[Apache Parquet](https://parquet.apache.org/) 为数据框提供了分区的二进制列序列化。它旨在使数据框的读写高效，并使数据在数据分析语言之间的共享变得容易。Parquet 可以使用各种压缩技术来尽可能地缩小文件大小，同时保持良好的读取性能。
Parquet 旨在忠实地序列化和反序列化 `DataFrame`，支持所有 pandas 的数据类型，包括带有时区的日期时间等扩展数据类型。
几个注意事项。
+   不支持重复的列名和非字符串的列名。
+   `pyarrow` 引擎始终将索引写入输出，但 `fastparquet` 仅写入非默认索引。这个额外的列可能会给那些不希望看到它的非 pandas 消费者带来问题。您可以使用 `index` 参数强制包含或省略索引，而不管底层引擎如何。
+   如果指定了索引级别名称，则必须是字符串。
+   在 `pyarrow` 引擎中，非字符串类型的分类数据类型可以序列化为 parquet，但会反序列化为其原始数据类型。
+   `pyarrow` 引擎保留了具有字符串类型的分类数据类型的 `ordered` 标志。`fastparquet` 不保留 `ordered` 标志。
+   不支持的类型包括 `Interval` 和实际的 Python 对象类型。在尝试序列化时，这些将引发一个有用的错误消息。`Period` 类型在 pyarrow >= 0.16.0 中受支持。
+   `pyarrow` 引擎保留扩展数据类型，如可空整数和字符串数据类型（需要 pyarrow >= 0.16.0，并要求扩展类型实现所需的协议，请参阅扩展类型文档）。
您可以指定一个`engine`来指导序列化。这可以是`pyarrow`、`fastparquet`或`auto`中的一个。如果未指定引擎，则会检查`pd.options.io.parquet.engine`选项；如果这也是`auto`，则尝试`pyarrow`，并回退到`fastparquet`。
参阅[pyarrow](https://arrow.apache.org/docs/python/)和[fastparquet](https://fastparquet.readthedocs.io/en/latest/)的文档。
注意
这些引擎非常相似，几乎可以读/写完全相同的 Parquet 格式文件。`pyarrow>=8.0.0`支持时间间隔数据，`fastparquet>=0.1.4`支持时区感知日期时间。这些库之间的区别在于具有不同的底层依赖关系（`fastparquet`使用`numba`，而`pyarrow`使用 C 库）。
```py
In [613]: df = pd.DataFrame(
 .....:    {
 .....:        "a": list("abc"),
 .....:        "b": list(range(1, 4)),
 .....:        "c": np.arange(3, 6).astype("u1"),
 .....:        "d": np.arange(4.0, 7.0, dtype="float64"),
 .....:        "e": [True, False, True],
 .....:        "f": pd.date_range("20130101", periods=3),
 .....:        "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
 .....:        "h": pd.Categorical(list("abc")),
 .....:        "i": pd.Categorical(list("abc"), ordered=True),
 .....:    }
 .....: )
 .....: 
In [614]: df
Out[614]: 
 a  b  c    d      e          f                         g  h  i
0  a  1  3  4.0   True 2013-01-01 2013-01-01 00:00:00-05:00  a  a
1  b  2  4  5.0  False 2013-01-02 2013-01-02 00:00:00-05:00  b  b
2  c  3  5  6.0   True 2013-01-03 2013-01-03 00:00:00-05:00  c  c
In [615]: df.dtypes
Out[615]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                datetime64[ns]
g    datetime64[ns, US/Eastern]
h                      category
i                      category
dtype: object

写入 Parquet 文件。

In [616]: df.to_parquet("example_pa.parquet", engine="pyarrow")
In [617]: df.to_parquet("example_fp.parquet", engine="fastparquet")

从 Parquet 文件中读取。

In [618]: result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
In [619]: result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
In [620]: result.dtypes
Out[620]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                datetime64[ns]
g    datetime64[ns, US/Eastern]
h                      category
i                      category
dtype: object

通过设置dtype_backend参数，您可以控制生成的 DataFrame 使用的默认数据类型。

In [621]: result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow")
In [622]: result.dtypes
Out[622]: 
a                                      string[pyarrow]
b                                       int64[pyarrow]
c                                       uint8[pyarrow]
d                                      double[pyarrow]
e                                        bool[pyarrow]
f                               timestamp[ns][pyarrow]
g                timestamp[ns, tz=US/Eastern][pyarrow]
h    dictionary<values=string, indices=int32, order...
i    dictionary<values=string, indices=int32, order...
dtype: object

注意

请注意，这对于fastparquet不受支持。

仅读取 Parquet 文件的特定列。

In [623]: result = pd.read_parquet(
 .....:    "example_fp.parquet",
 .....:    engine="fastparquet",
 .....:    columns=["a", "b"],
 .....: )
 .....: 
In [624]: result = pd.read_parquet(
 .....:    "example_pa.parquet",
 .....:    engine="pyarrow",
 .....:    columns=["a", "b"],
 .....: )
 .....: 
In [625]: result.dtypes
Out[625]: 
a    object
b     int64
dtype: object

Pandas 2.2 中文官方教程和指南（十·二）（2）https://developer.aliyun.com/article/1509787

Pandas 2.2 中文官方教程和指南（十·二）（1）

存储多级索引的 `DataFrames`

查询表

查询 timedelta64[ns]

索引

通过数据列查询

迭代器

高级查询

选择单列

选择坐标

存储器对象

多表查询

从表中删除

注意事项 & 警告

压缩

`ptrepack`

`分类数据`

`字符串列`

`性能`

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Pandas 2.2 中文官方教程和指南（十·二）（1）

存储多级索引的 DataFrames

查询表

查询 timedelta64[ns]

索引

通过数据列查询

迭代器

高级查询

选择单列

选择坐标

存储器对象

多表查询

从表中删除

注意事项 & 警告

压缩

ptrepack

分类数据

字符串列

性能

热门文章

最新文章

相关课程

相关电子书

相关实验场景

存储多级索引的 `DataFrames`

`ptrepack`

`分类数据`

`字符串列`

`性能`