Pandas 2.2 中文官方教程和指南（十一·一）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十一·一）（3）https://developer.aliyun.com/article/1509862

`in`和`not in`运算符

query()还支持 Python 的in和not in比较运算符的特殊用法，提供了一种简洁的语法来调用Series或DataFrame的isin方法。

# get all rows where columns "a" and "b" have overlapping values
In [267]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
 .....:                   'c': np.random.randint(5, size=12),
 .....:                   'd': np.random.randint(9, size=12)})
 .....: 
In [268]: df
Out[268]: 
 a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2
In [269]: df.query('a in b')
Out[269]: 
 a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2
# How you'd do it in pure Python
In [270]: df[df['a'].isin(df['b'])]
Out[270]: 
 a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2
In [271]: df.query('a not in b')
Out[271]: 
 a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2
# pure Python
In [272]: df[~df['a'].isin(df['b'])]
Out[272]: 
 a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

您可以将其与其他表达式结合使用，以获得非常简洁的查询：

# rows where cols a and b have overlapping values
# and col c's values are less than col d's
In [273]: df.query('a in b and c < d')
Out[273]: 
 a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
4  c  b  3  6
5  c  b  0  2
# pure Python
In [274]: df[df['b'].isin(df['a']) & (df['c'] < df['d'])]
Out[274]: 
 a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
4   c  b  3  6
5   c  b  0  2
10  f  c  0  6
11  f  c  1  2

注意

请注意，in和not in在 Python 中进行评估，因为numexpr没有此操作的等效操作。但是，只有in/not in 表达式本身在普通 Python 中评估。例如，在表达式中

df.query('a in b + c + d')

(b + c + d)由numexpr计算，然后在普通 Python 中评估in操作。一般来说，任何可以使用numexpr计算的操作都将被计算。

与`list`对象一起使用`==`运算符的特殊用法

使用==/!=将值列表与列进行比较与使用in/not in类似。

In [275]: df.query('b == ["a", "b", "c"]')
Out[275]: 
 a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2
# pure Python
In [276]: df[df['b'].isin(["a", "b", "c"])]
Out[276]: 
 a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2
In [277]: df.query('c == [1, 2]')
Out[277]: 
 a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2
In [278]: df.query('c != [1, 2]')
Out[278]: 
 a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6
# using in/not in
In [279]: df.query('[1, 2] in c')
Out[279]: 
 a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2
In [280]: df.query('[1, 2] not in c')
Out[280]: 
 a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6
# pure Python
In [281]: df[df['c'].isin([1, 2])]
Out[281]: 
 a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

布尔运算符

您可以使用not单词或~运算符否定布尔表达式。

In [282]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [283]: df['bools'] = np.random.rand(len(df)) > 0.5
In [284]: df.query('~bools')
Out[284]: 
 a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False
In [285]: df.query('not bools')
Out[285]: 
 a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False
In [286]: df.query('not bools') == df[~df['bools']]
Out[286]: 
 a     b     c  bools
2  True  True  True   True
7  True  True  True   True
8  True  True  True   True

当然，表达式也可以任意复杂：

# short query syntax
In [287]: shorter = df.query('a < b < c and (not bools) or bools > 2')
# equivalent in pure Python
In [288]: longer = df[(df['a'] < df['b'])
 .....:            & (df['b'] < df['c'])
 .....:            & (~df['bools'])
 .....:            | (df['bools'] > 2)]
 .....: 
In [289]: shorter
Out[289]: 
 a         b         c  bools
7  0.275396  0.691034  0.826619  False
In [290]: longer
Out[290]: 
 a         b         c  bools
7  0.275396  0.691034  0.826619  False
In [291]: shorter == longer
Out[291]: 
 a     b     c  bools
7  True  True  True   True

`query()`的性能

使用numexpr的DataFrame.query()比 Python 对于大型框架稍微快一些。

只有当您的框架具有大约 100,000 行以上时，使用numexpr引擎与DataFrame.query()才会看到性能优势。

此图是使用包含使用numpy.random.randn()生成的浮点值的 3 列的DataFrame创建的。

In [292]: df = pd.DataFrame(np.random.randn(8, 4),
 .....:                  index=dates, columns=['A', 'B', 'C', 'D'])
 .....: 
In [293]: df2 = df.copy()

重复数据

如果您想要识别和删除 DataFrame 中的重复行，有两种方法可以帮助：duplicated和drop_duplicates。每个方法都以要用于识别重复行的列作为参数。

duplicated返回一个布尔向量，其长度为行数，指示行是否重复。
drop_duplicates删除重复行。

默认情况下，重复集的第一个观察行被视为唯一，但每种方法都有一个keep参数来指定要保留的目标。

keep='first'（默认）：标记/删除除第一次出现之外的重复项。
keep='last'：标记/删除除最后一次出现之外的重复项。
keep=False：标记/删除所有重复项。

In [294]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
 .....:                    'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
 .....:                    'c': np.random.randn(7)})
 .....: 
In [295]: df2
Out[295]: 
 a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329
In [296]: df2.duplicated('a')
Out[296]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool
In [297]: df2.duplicated('a', keep='last')
Out[297]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool
In [298]: df2.duplicated('a', keep=False)
Out[298]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool
In [299]: df2.drop_duplicates('a')
Out[299]: 
 a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329
In [300]: df2.drop_duplicates('a', keep='last')
Out[300]: 
 a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329
In [301]: df2.drop_duplicates('a', keep=False)
Out[301]: 
 a  b         c
5  three  x -1.964475
6   four  x  1.298329

此外，你可以传递一个列的列表来识别重复项。

In [302]: df2.duplicated(['a', 'b'])
Out[302]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool
In [303]: df2.drop_duplicates(['a', 'b'])
Out[303]: 
 a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329

要通过索引值删除重复项，请使用Index.duplicated然后执行切片。keep参数也有相同的选项。

In [304]: df3 = pd.DataFrame({'a': np.arange(6),
 .....:                    'b': np.random.randn(6)},
 .....:                   index=['a', 'a', 'b', 'c', 'b', 'a'])
 .....: 
In [305]: df3
Out[305]: 
 a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764
In [306]: df3.index.duplicated()
Out[306]: array([False,  True, False, False,  True,  True])
In [307]: df3[~df3.index.duplicated()]
Out[307]: 
 a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409
In [308]: df3[~df3.index.duplicated(keep='last')]
Out[308]: 
 a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764
In [309]: df3[~df3.index.duplicated(keep=False)]
Out[309]: 
 a         b
c  3 -0.894409

类似字典的 `get()` 方法

每个 Series 或 DataFrame 都有一个get方法，可以返回一个默认值。

In [310]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [311]: s.get('a')  # equivalent to s['a']
Out[311]: 1
In [312]: s.get('x', default=-1)
Out[312]: -1 
```## 通过索引/列标签查找值
有时你想提取一组值，给定一系列行标签和列标签，这可以通过`pandas.factorize`和 NumPy 索引实现。例如：
```py
In [313]: df = pd.DataFrame({'col': ["A", "A", "B", "B"],
 .....:                   'A': [80, 23, np.nan, 22],
 .....:                   'B': [80, 55, 76, 67]})
 .....: 
In [314]: df
Out[314]: 
 col     A   B
0   A  80.0  80
1   A  23.0  55
2   B   NaN  76
3   B  22.0  67
In [315]: idx, cols = pd.factorize(df['col'])
In [316]: df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Out[316]: array([80., 23., 76., 67.])

以前可以使用专用的DataFrame.lookup方法来实现这一点，该方法在 1.2.0 版本中已弃用，并在 2.0.0 版本中移除。 ## 索引对象

Index 类及其子类可以被视为实现了有序多重集。允许重复。

Index 还提供了查找、数据对齐和重新索引所需的基础设施。直接创建一个Index的最简单方法是将一个list或其他序列传递给Index：

In [317]: index = pd.Index(['e', 'd', 'a', 'b'])
In [318]: index
Out[318]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [319]: 'd' in index
Out[319]: True

或使用数字：

In [320]: index = pd.Index([1, 5, 12])
In [321]: index
Out[321]: Index([1, 5, 12], dtype='int64')
In [322]: 5 in index
Out[322]: True

如果没有给出 dtype，Index会尝试从数据中推断 dtype。也可以在实例化Index时给出显式 dtype：

In [323]: index = pd.Index(['e', 'd', 'a', 'b'], dtype="string")
In [324]: index
Out[324]: Index(['e', 'd', 'a', 'b'], dtype='string')
In [325]: index = pd.Index([1, 5, 12], dtype="int8")
In [326]: index
Out[326]: Index([1, 5, 12], dtype='int8')
In [327]: index = pd.Index([1, 5, 12], dtype="float32")
In [328]: index
Out[328]: Index([1.0, 5.0, 12.0], dtype='float32')

你也可以传递一个name来存储在索引中：

In [329]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
In [330]: index.name
Out[330]: 'something'

如果设置了名称，将显示在控制台显示中：

In [331]: index = pd.Index(list(range(5)), name='rows')
In [332]: columns = pd.Index(['A', 'B', 'C'], name='cols')
In [333]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
In [334]: df
Out[334]: 
cols         A         B         C
rows 
0     1.295989 -1.051694  1.340429
1    -2.366110  0.428241  0.387275
2     0.433306  0.929548  0.278094
3     2.154730 -0.315628  0.264223
4     1.126818  1.132290 -0.353310
In [335]: df['A']
Out[335]: 
rows
0    1.295989
1   -2.366110
2    0.433306
3    2.154730
4    1.126818
Name: A, dtype: float64

设置元数据

索引“大部分是不可变的”，但是可以设置和更改它们的name属性。你可以使用rename、set_names直接设置这些属性，默认返回一个副本。

有关多重索引的用法，请参见高级索引。

In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')

set_names、set_levels和set_codes还接受一个可选的level参数

In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]: 
MultiIndex([(0, 'one'),
 (0, 'two'),
 (1, 'one'),
 (1, 'two'),
 (2, 'one'),
 (2, 'two')],
 names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]: 
MultiIndex([(0, 'a'),
 (0, 'b'),
 (1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['first', 'second']) 
```### 索引对象上的集合操作
两个主要操作是`union`和`intersection`。差异通过`.difference()`方法提供。
```py
In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

还提供了symmetric_difference操作，返回出现在idx1或idx2中的元素，但不在两者中都出现的元素。这等同于通过idx1.difference(idx2).union(idx2.difference(idx1))创建的索引，重复项被删除。

In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')

注意

从集合操作得到的索引将按升序排序。

在不同 dtype 的索引之间执行Index.union()时，索引必须转换为公共 dtype。通常，尽管不总是如此，这是对象 dtype。唯一的例外是在整数和浮点数据之间执行联合时。在这种情况下，整数值将转换为浮点数

In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64') 
```### 缺失值
重要
即使`Index`可以包含缺失值（`NaN`），如果不希望出现任何意外结果，应该避免使用它。例如，一些操作隐含地排除缺失值。
`Index.fillna`使用指定的标量值填充缺失值。
```py
In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
 .....:                         pd.NaT,
 .....:                         pd.Timestamp('2011-01-03')])
 .....: 
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

设置/重置索引

有时您会将数据集加载或创建到 DataFrame 中，并希望在已经这样做之后添加索引。有几种不同的方法。

设置索引

DataFrame 有一个set_index()方法，它接受一个列名（对于常规Index）或一个列名列表（对于MultiIndex）。要创建一个新的重新索引的 DataFrame：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
 .....:                     'b': ['one', 'two', 'one', 'two'],
 .....:                     'c': ['z', 'y', 'x', 'w'],
 .....:                     'd': [1., 2., 3, 4]})
 .....: 
In [362]: data
Out[362]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]: 
 a    b    d
c 
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]: 
 c    d
a   b 
bar one  z  1.0
 two  y  2.0
foo one  x  3.0
 two  w  4.0

append关键字选项允许您保留现有索引并将给定列附加到 MultiIndex：

In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他选项允许您不删除索引列。

In [370]: data.set_index('c', drop=False)
Out[370]: 
 a    b  c    d
c 
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

重置索引

作为一种便利，DataFrame 上有一个名为reset_index()的新函数，它将索引值转移到 DataFrame 的列中，并设置一个简单的整数索引。这是set_index()的逆操作。

In [371]: data
Out[371]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [372]: data.reset_index()
Out[372]: 
 index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0

输出更类似于 SQL 表或记录数组。从索引派生的列的名称存储在names属性中。

您可以使用level关键字仅删除索引的一部分：

In [373]: frame
Out[373]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
In [374]: frame.reset_index(level=1)
Out[374]: 
 a  c    d
c b 
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index接受一个可选参数drop，如果为 true，则简单丢弃索引，而不是将索引值放入 DataFrame 的列中。

添加临时索引

您可以为index属性分配自定义索引：

In [375]: df_idx = pd.DataFrame(range(4))
In [376]: df_idx.index = pd.Index([10, 20, 30, 40], name="a")
In [377]: df_idx
Out[377]: 
 0
a 
10  0
20  1
30  2
40  3

返回视图与副本

警告

Copy-on-Write 将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再必要。有关更多上下文，请参见此部分。我们建议打开 Copy-on-Write 以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

在设置 pandas 对象的值时，必须小心避免所谓的chained indexing。这里是一个例子。

In [378]: dfmi = pd.DataFrame([list('abcd'),
 .....:                     list('efgh'),
 .....:                     list('ijkl'),
 .....:                     list('mnop')],
 .....:                    columns=pd.MultiIndex.from_product([['one', 'two'],
 .....:                                                        ['first', 'second']]))
 .....: 
In [379]: dfmi
Out[379]: 
 one          two 
 first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比较这两种访问方法：

In [380]: dfmi['one']['second']
Out[380]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [381]: dfmi.loc[:, ('one', 'second')]
Out[381]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

这两者产生相同的结果，那么应该使用哪一个呢？理解这些操作的顺序以及为什么方法 2（.loc）远比方法 1（链式[]）更可取是很有指导意义的。

dfmi['one']选择列的第一级，并返回一个单索引的 DataFrame。然后另一个 Python 操作dfmi_with_one['second']选择由'second'索引的系列。这由变量dfmi_with_one表示，因为 pandas 将这些操作视为独立事件。例如。独立调用 __getitem__，因此必须将它们视为线性操作，它们一个接一个地发生。

将其与df.loc[:,('one','second')]进行对比，后者传递了一个嵌套元组(slice(None),('one','second'))给单个__getitem__调用。这使得 pandas 能够将其视为单个实体处理。此外，这种操作顺序可能更快，并且如果需要，允许同时索引两个轴。

Pandas 2.2 中文官方教程和指南（十一·一）（4）

`in`和`not in`运算符

与`list`对象一起使用`==`运算符的特殊用法

布尔运算符

`query()`的性能

重复数据

类似字典的 `get()` 方法

设置元数据

设置/重置索引

设置索引

重置索引

添加临时索引

返回视图与副本

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Pandas 2.2 中文官方教程和指南（十一·一）（4）

in和not in运算符

与list对象一起使用==运算符的特殊用法

布尔运算符

query()的性能

重复数据

类似字典的 get() 方法

设置元数据

设置/重置索引

设置索引

重置索引

添加临时索引

返回视图与副本

热门文章

最新文章

相关课程

相关电子书

相关实验场景

`in`和`not in`运算符

与`list`对象一起使用`==`运算符的特殊用法

`query()`的性能

类似字典的 `get()` 方法