Pandas 2.2 中文官方教程和指南（十一·二）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十一·二）（3）https://developer.aliyun.com/article/1509871

索引对象

pandas 的 Index 类及其子类可视为实现了一个有序多重集。允许存在重复值。

Index 还提供了进行查找、数据对齐和重新索引所必需的基础设施。直接创建一个 Index 的最简单方法是将一个 list 或其他序列传递给 Index：

In [317]: index = pd.Index(['e', 'd', 'a', 'b'])
In [318]: index
Out[318]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [319]: 'd' in index
Out[319]: True

或者使用数字：

In [320]: index = pd.Index([1, 5, 12])
In [321]: index
Out[321]: Index([1, 5, 12], dtype='int64')
In [322]: 5 in index
Out[322]: True

如果没有指定数据类型，Index 将尝试从数据中推断数据类型。在实例化 Index 时，也可以给出显式数据类型：

In [323]: index = pd.Index(['e', 'd', 'a', 'b'], dtype="string")
In [324]: index
Out[324]: Index(['e', 'd', 'a', 'b'], dtype='string')
In [325]: index = pd.Index([1, 5, 12], dtype="int8")
In [326]: index
Out[326]: Index([1, 5, 12], dtype='int8')
In [327]: index = pd.Index([1, 5, 12], dtype="float32")
In [328]: index
Out[328]: Index([1.0, 5.0, 12.0], dtype='float32')

你还可以传递一个 name 以存储在索引中：

In [329]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
In [330]: index.name
Out[330]: 'something'

如果设置了名称，将在控制台显示：

In [331]: index = pd.Index(list(range(5)), name='rows')
In [332]: columns = pd.Index(['A', 'B', 'C'], name='cols')
In [333]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
In [334]: df
Out[334]: 
cols         A         B         C
rows 
0     1.295989 -1.051694  1.340429
1    -2.366110  0.428241  0.387275
2     0.433306  0.929548  0.278094
3     2.154730 -0.315628  0.264223
4     1.126818  1.132290 -0.353310
In [335]: df['A']
Out[335]: 
rows
0    1.295989
1   -2.366110
2    0.433306
3    2.154730
4    1.126818
Name: A, dtype: float64

设置元数据

索引“基本上是不可变的”，但可以设置和更改它们的 name 属性。你可以直接使用 rename、set_names 来设置这些属性，默认情况下它们返回一个副本。

请参阅高级索引了解使用多级索引的用法。

In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')

set_names、set_levels 和 set_codes 还可以接受一个可选的 level 参数。

In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]: 
MultiIndex([(0, 'one'),
 (0, 'two'),
 (1, 'one'),
 (1, 'two'),
 (2, 'one'),
 (2, 'two')],
 names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]: 
MultiIndex([(0, 'a'),
 (0, 'b'),
 (1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['first', 'second']) 
```### 索引对象的集合操作
两个主要操作是 `union` 和 `intersection`。差异通过 `.difference()` 方法提供。
```py
In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

还提供了 symmetric_difference 操作，它返回出现在 idx1 或 idx2 中的元素，但不在两者中都出现的元素。这相当于由 idx1.difference(idx2).union(idx2.difference(idx1)) 创建的索引，重复值将被删除。

In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')

注意

从集合操作的结果索引将按升序排序。

当执行 Index.union() 时，对于具有不同数据类型的索引，索引必须转换为一个公共数据类型。通常情况下，虽然不是绝对的，这个数据类型是对象数据类型。唯一的例外是在整数和浮点数数据之间执行联合操作时。在这种情况下，整数值将被转换为浮点数。

In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64') 
```### 缺失值
重要
即使 `Index` 可以容纳缺失值（`NaN`），如果你不希望出现任何意外结果，最好避免使用它。例如，一些操作隐式排除缺失值。
`Index.fillna` 用指定的标量值填充缺失值。
```py
In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
 .....:                         pd.NaT,
 .....:                         pd.Timestamp('2011-01-03')])
 .....: 
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None) 
```### 设置元数据
索引“基本上是不可变的”，但可以设置和更改它们的`name`属性。您可以使用`rename`，`set_names`直接设置这些属性，默认返回一个副本。
请参阅高级索引以了解多重索引的用法。
```py
In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')

set_names，set_levels和set_codes还接受一个可选的level参数

In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]: 
MultiIndex([(0, 'one'),
 (0, 'two'),
 (1, 'one'),
 (1, 'two'),
 (2, 'one'),
 (2, 'two')],
 names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]: 
MultiIndex([(0, 'a'),
 (0, 'b'),
 (1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['first', 'second'])

索引对象上的集合操作

两个主要操作是union和intersection。差异通过.difference()方法提供。

In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

还提供了symmetric_difference操作，它返回出现在idx1或idx2中的元素，但不在两者中都出现的元素。这等效于由idx1.difference(idx2).union(idx2.difference(idx1))创建的索引，重复项已删除。

In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')

注意

从集合操作中得到的结果索引将按升序排序。

在具有不同数据类型的索引之间执行Index.union()时，索引必须转换为公共数据类型。通常情况下，尽管不总是如此，这是对象数据类型。唯一的例外是在整数和浮点数据之间执行联合时。在这种情况下，整数值将转换为浮点数

In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

缺失值

重要

即使Index可以容纳缺失值（NaN），如果您不希望出现任何意外结果，应该避免使用它。例如，一些操作会隐式排除缺失值。

Index.fillna使用指定的标量值填充缺失值。

In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
 .....:                         pd.NaT,
 .....:                         pd.Timestamp('2011-01-03')])
 .....: 
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

设置/重置索引

有时您会将数据集加载或创建到 DataFrame 中，并希望在已经这样做之后添加索引。有几种不同的方法。

设置索引

DataFrame 具有set_index()方法，该方法接受列名（用于常规Index）或列名列表（用于MultiIndex）。要创建一个新的重新索引的 DataFrame：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
 .....:                     'b': ['one', 'two', 'one', 'two'],
 .....:                     'c': ['z', 'y', 'x', 'w'],
 .....:                     'd': [1., 2., 3, 4]})
 .....: 
In [362]: data
Out[362]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]: 
 a    b    d
c 
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]: 
 c    d
a   b 
bar one  z  1.0
 two  y  2.0
foo one  x  3.0
 two  w  4.0

append关键字选项允许您保留现有索引并将给定列附加到 MultiIndex 中：

In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他选项允许您不删除索引列。

In [370]: data.set_index('c', drop=False)
Out[370]: 
 a    b  c    d
c 
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

重置索引

作为便利，DataFrame 上有一个名为reset_index()的新函数，它将索引值转移到 DataFrame 的列中并设置一个简单的整数索引。这是set_index()的逆操作。

In [371]: data
Out[371]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [372]: data.reset_index()
Out[372]: 
 index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0

输出更类似于 SQL 表或记录数组。从索引��生的列的名称存储在names属性中。

您可以使用level关键字仅删除索引的一部分：

In [373]: frame
Out[373]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
In [374]: frame.reset_index(level=1)
Out[374]: 
 a  c    d
c b 
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index接受一个可选参数drop，如果为 true，则简单丢弃索引，而不是将索引值放入 DataFrame 的列中。

添加一个临时索引

你可以为index属性分配一个自定义索引：

In [375]: df_idx = pd.DataFrame(range(4))
In [376]: df_idx.index = pd.Index([10, 20, 30, 40], name="a")
In [377]: df_idx
Out[377]: 
 0
a 
10  0
20  1
30  2
40  3

设置索引

DataFrame 有一个set_index()方法，它接受一个列名（用于常规Index）或一个列名列表（用于MultiIndex）。要创建一个新的、重新索引的 DataFrame：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
 .....:                     'b': ['one', 'two', 'one', 'two'],
 .....:                     'c': ['z', 'y', 'x', 'w'],
 .....:                     'd': [1., 2., 3, 4]})
 .....: 
In [362]: data
Out[362]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]: 
 a    b    d
c 
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]: 
 c    d
a   b 
bar one  z  1.0
 two  y  2.0
foo one  x  3.0
 two  w  4.0

append关键字选项允许你保留现有索引并将给定列附加到多重索引中：

In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他选项允许你不丢弃索引列。

In [370]: data.set_index('c', drop=False)
Out[370]: 
 a    b  c    d
c 
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

重置索引

为方便起见，DataFrame 上有一个名为reset_index()的新函数，它将索引值转移到 DataFrame 的列中，并设置一个简单的整数索引。这是set_index()的逆操作。

In [371]: data
Out[371]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [372]: data.reset_index()
Out[372]: 
 index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0

输出更类似于 SQL 表或记录数组。从索引派生的列的名称存储在names属性中。

你可以使用level关键字仅删除索引的一部分：

In [373]: frame
Out[373]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
In [374]: frame.reset_index(level=1)
Out[374]: 
 a  c    d
c b 
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index接受一个可选参数drop，如果为 true，则简单丢弃索引，而不是将索引值放入 DataFrame 的列中。

添加一个临时索引

你可以为index属性分配一个自定义索引：

In [375]: df_idx = pd.DataFrame(range(4))
In [376]: df_idx.index = pd.Index([10, 20, 30, 40], name="a")
In [377]: df_idx
Out[377]: 
 0
a 
10  0
20  1
30  2
40  3

返回视图还是副本

警告

Copy-on-Write 将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再必要。查看此部分以获取更多背景信息。我们建议打开 Copy-on-Write 以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

在设置 pandas 对象的值时，必须小心避免所谓的chained indexing。这里有一个例子。

In [378]: dfmi = pd.DataFrame([list('abcd'),
 .....:                     list('efgh'),
 .....:                     list('ijkl'),
 .....:                     list('mnop')],
 .....:                    columns=pd.MultiIndex.from_product([['one', 'two'],
 .....:                                                        ['first', 'second']]))
 .....: 
In [379]: dfmi
Out[379]: 
 one          two 
 first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比较这两种访问方法：

In [380]: dfmi['one']['second']
Out[380]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [381]: dfmi.loc[:, ('one', 'second')]
Out[381]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

这两者产生相同的结果，那么应该使用哪一个？了解这些操作的顺序以及为什么方法 2（.loc）比方法 1（chained []）更受欢迎是很有启发性的。

dfmi['one']选择列的第一级，并返回一个单索引的 DataFrame。然后另一个 Python 操作dfmi_with_one['second']选择由'second'索引的系列。这由变量dfmi_with_one表示，因为 pandas 将这些操作视为独立事件。例如，独立调用__getitem__，因此必须将它们视为线性操作，它们一个接一个地发生。

与 df.loc[:,('one','second')] 相比，它传递了一个嵌套元组 (slice(None),('one','second')) 到一个对 __getitem__ 的单次调用。这使得 pandas 能够将其视为一个单一实体处理。此外，这种操作顺序可能明显更快，并且允许在需要时索引两个轴。

使用链式索引时为什么赋值会失败？

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning 将不再必要。查看此部分以获取更多上下文。我们建议打开写时复制以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

前一节中的问题只是一个性能问题。SettingWithCopy 警告是怎么回事？当你做一些可能多花几毫秒的事情时，我们通常不会发出警告！

但事实证明，对链式索引的乘积进行赋值具有固有的不可预测结果。要了解这一点，请考虑 Python 解释器如何执行此代码：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但是这段代码处理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到那里的 __getitem__ 了吗？除了简单情况外，很难预测它是否会返回视图或副本（这取决于数组的内存布局，pandas 对此不做任何保证），因此 __setitem__ 是否会修改 dfmi 或立即丢弃的临时对象。这就是 SettingWithCopy 警告您的原因！

注意

也许你会想知道我们是否应该担心第一个示例中的 loc 属性。但是 dfmi.loc 保证是 dfmi 本身，具有修改后的索引行为，因此 dfmi.loc.__getitem__ / dfmi.loc.__setitem__ 直接在 dfmi 上操作。当然，dfmi.loc.__getitem__(idx) 可能是 dfmi 的视图或副本。

有时会在没有明显的链式索引的情况下出现 SettingWithCopy 警告。这些是 SettingWithCopy 设计用来捕捉的错误！pandas 可能试图警告您已经这样做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

哎呀！

评估顺序很重要

警告

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

当您使用链式索引时，索引操作的顺序和类型部分确定结果是原始对象的切片，还是切片的副本。

pandas 具有SettingWithCopyWarning，因为将分片的副本分配给链式索引通常不是有意的，而是由链式索引返回副本而不是预期的分片而导致的错误。

如果您希望 pandas 对链式索引表达式的分配更加信任或不信任，可以将选项mode.chained_assignment设置为以下值之一：

'warn'，默认值，意味着会打印一个SettingWithCopyWarning。
'raise'意味着 pandas 将引发SettingWithCopyError，您必须处理。
None将完全禁止警告。

In [382]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [383]: dfb['c'][dfb['a'].str.startswith('o')] = 42

然而，这在副本上运行，将不起作用。

In [384]: with pd.option_context('mode.chained_assignment','warn'):
 .....:    dfb[dfb['a'].str.startswith('o')]['c'] = 42
 .....:

链式赋值也可能在设置混合 dtype 帧时出现。

注意

这些设置规则适用于所有.loc/.iloc。

以下是使用.loc进行多个项目（使用mask）和使用固定索引进行单个项目的推荐访问方法：

In [385]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
In [386]: dfd = dfc.copy()
# Setting multiple items using a mask
In [387]: mask = dfd['a'].str.startswith('o')
In [388]: dfd.loc[mask, 'c'] = 42
In [389]: dfd
Out[389]: 
 a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6
# Setting a single item
In [390]: dfd = dfc.copy()
In [391]: dfd.loc[2, 'a'] = 11
In [392]: dfd
Out[392]: 
 a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

以下内容可能有时可以工作，但不能保证，因此应该避免：

In [393]: dfd = dfc.copy()
In [394]: dfd['a'][2] = 111
In [395]: dfd
Out[395]: 
 a  c
0    one  0
1    one  1
2    111  2
3  three  3
4    two  4
5    one  5
6    six  6

最后，后续示例将完全无法工作，因此应该避免：

In [396]: with pd.option_context('mode.chained_assignment','raise'):
 .....:    dfd.loc[0]['a'] = 1111
 .....: 
---------------------------------------------------------------------------
SettingWithCopyError  Traceback (most recent call last)
<ipython-input-396-32ce785aaa5b> in ?()
  1 with pd.option_context('mode.chained_assignment','raise'):
----> 2     dfd.loc[0]['a'] = 1111
~/work/pandas/pandas/pandas/core/series.py in ?(self, key, value)
  1284                 )
  1285 
  1286         check_dict_or_set_indexers(key)
  1287         key = com.apply_if_callable(key, self)
-> 1288         cacher_needs_updating = self._check_is_chained_assignment_possible()
  1289 
  1290         if key is Ellipsis:
  1291             key = slice(None)
~/work/pandas/pandas/pandas/core/series.py in ?(self)
  1489             ref = self._get_cacher()
  1490             if ref is not None and ref._is_mixed_type:
  1491                 self._check_setitem_copy(t="referent", force=True)
  1492             return True
-> 1493         return super()._check_is_chained_assignment_possible()
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  4395         single-dtype meaning that the cacher should be updated following
  4396         setting.
  4397         """
  4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
  4400 return False
~/work/pandas/pandas/pandas/core/generic.py in ?(self, t, force)
  4469 "indexing.html#returning-a-view-versus-a-copy"
  4470 )
  4471  
  4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
  4474 if value == "warn":
  4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

警告

链式赋值警告/异常旨在通知用户可能无效的赋值。可能存在误报的情况；链式赋值意外报告的情况。

当使用链式索引时为什么分配失败？

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再需要。有关更多上下文，请参阅本节。我们建议打开写时复制以利用改进

pd.options.mode.copy_on_write = True

甚至在 pandas 3.0 可用之前。

前一节中的问题只是一个性能问题。那么SettingWithCopy警告是什么情况？当您执行可能需要额外几毫秒的操作时，我们通常不会发出警告！

但事实证明，将链式索引的产品赋值给它具有不可预测的结果。要看到这一点，想象一下 Python 解释器如何执行此代码：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但是此代码处理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到那里的__getitem__了吗？在简单情况之外，很难预测它是否会返回视图或副本（这取决于数组的内存布局，pandas 对此不提供任何保证），因此无论__setitem__是否会修改dfmi还是立即被丢弃的临时对象。这就是SettingWithCopy警告您的内容！

注意

您可能想知道我们是否应该担心第一个示例中的loc属性。但是保证dfmi.loc将是具有修改索引行为的dfmi本身，因此dfmi.loc.__getitem__ / dfmi.loc.__setitem__直接在dfmi上操作。当然，dfmi.loc.__getitem__(idx)可能是dfmi的视图或副本。

有时会在没有明显的链式索引的情况下出现SettingWithCopy警告。这些是SettingWithCopy旨在捕获的错误！pandas 可能试图警告您已经这样做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

哎呀！

评估顺序很重要

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再必要。请参见本节了解更多上下文。我们建议打开 Copy-on-Write 以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

当您使用链式索引时，索引操作的顺序和类型部分确定结果是原始对象的切片，还是切片的副本。

pandas 有SettingWithCopyWarning，因为给切片的副本赋值通常不是有意的，而是由链式索引返回副本而预期切片引起的错误。

如果您希望 pandas 在对链式索引表达式进行赋值时更加信任或不信任，可以将选项 mode.chained_assignment 设置为以下值之一：

'warn'，默认值，意味着将打印SettingWithCopyWarning。
'raise' 表示 pandas 将引发SettingWithCopyError，您必须处理。
None 将完全抑制警告。

In [382]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [383]: dfb['c'][dfb['a'].str.startswith('o')] = 42

然而，这是在副本上操作的，不会起作用。

In [384]: with pd.option_context('mode.chained_assignment','warn'):
 .....:    dfb[dfb['a'].str.startswith('o')]['c'] = 42
 .....:

链式赋值也可能在设置混合 dtype 框架时出现。

注意

这些设置规则适用于所有.loc/.iloc。

以下是使用.loc进行多个项目的推荐访问方法（使用mask）以及使用固定索引访问单个项目：

In [385]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
In [386]: dfd = dfc.copy()
# Setting multiple items using a mask
In [387]: mask = dfd['a'].str.startswith('o')
In [388]: dfd.loc[mask, 'c'] = 42
In [389]: dfd
Out[389]: 
 a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6
# Setting a single item
In [390]: dfd = dfc.copy()
In [391]: dfd.loc[2, 'a'] = 11
In [392]: dfd
Out[392]: 
 a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

以下有时可以工作，但不能保证，因此应避免使用：

In [393]: dfd = dfc.copy()
In [394]: dfd['a'][2] = 111
In [395]: dfd
Out[395]: 
 a  c
0    one  0
1    one  1
2    111  2
3  three  3
4    two  4
5    one  5
6    six  6

最后，以下后续示例将完全不起作用，因此应避免使用：

In [396]: with pd.option_context('mode.chained_assignment','raise'):
 .....:    dfd.loc[0]['a'] = 1111
 .....: 
---------------------------------------------------------------------------
SettingWithCopyError  Traceback (most recent call last)
<ipython-input-396-32ce785aaa5b> in ?()
  1 with pd.option_context('mode.chained_assignment','raise'):
----> 2     dfd.loc[0]['a'] = 1111
~/work/pandas/pandas/pandas/core/series.py in ?(self, key, value)
  1284                 )
  1285 
  1286         check_dict_or_set_indexers(key)
  1287         key = com.apply_if_callable(key, self)
-> 1288         cacher_needs_updating = self._check_is_chained_assignment_possible()
  1289 
  1290         if key is Ellipsis:
  1291             key = slice(None)
~/work/pandas/pandas/pandas/core/series.py in ?(self)
  1489             ref = self._get_cacher()
  1490             if ref is not None and ref._is_mixed_type:
  1491                 self._check_setitem_copy(t="referent", force=True)
  1492             return True
-> 1493         return super()._check_is_chained_assignment_possible()
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  4395         single-dtype meaning that the cacher should be updated following
  4396         setting.
  4397         """
  4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
  4400 return False
~/work/pandas/pandas/pandas/core/generic.py in ?(self, t, force)
  4469 "indexing.html#returning-a-view-versus-a-copy"
  4470 )
  4471  
  4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
  4474 if value == "warn":
  4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

警告

链式赋值警告/异常旨在通知用户可能无效的赋值。可能会出现误报；链式赋值被错误地报告的情况。

Pandas 2.2 中文官方教程和指南（十一·二）（4）

索引对象

设置元数据

索引对象上的集合操作

缺失值

设置/重置索引

设置索引

重置索引

添加一个临时索引

设置索引

重置索引

添加一个临时索引

返回视图还是副本

使用链式索引时为什么赋值会失败？

评估顺序很重要

当使用链式索引时为什么分配失败？

评估顺序很重要

热门文章

最新文章

相关课程

相关电子书

相关实验场景