Pandas 2.2 中文官方教程和指南（十一·二）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十一·二）（3）https://developer.aliyun.com/article/1509871

索引对象

pandas 的 Index 类及其子类可视为实现了一个有序多重集。允许存在重复值。

Index 还提供了进行查找、数据对齐和重新索引所必需的基础设施。直接创建一个 Index 的最简单方法是将一个 list 或其他序列传递给 Index：

In [317]: index = pd.Index(['e', 'd', 'a', 'b'])
In [318]: index
Out[318]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [319]: 'd' in index
Out[319]: True

或者使用数字：

In [320]: index = pd.Index([1, 5, 12])
In [321]: index
Out[321]: Index([1, 5, 12], dtype='int64')
In [322]: 5 in index
Out[322]: True

如果没有指定数据类型，Index 将尝试从数据中推断数据类型。在实例化 Index 时，也可以给出显式数据类型：

In [323]: index = pd.Index(['e', 'd', 'a', 'b'], dtype="string")
In [324]: index
Out[324]: Index(['e', 'd', 'a', 'b'], dtype='string')
In [325]: index = pd.Index([1, 5, 12], dtype="int8")
In [326]: index
Out[326]: Index([1, 5, 12], dtype='int8')
In [327]: index = pd.Index([1, 5, 12], dtype="float32")
In [328]: index
Out[328]: Index([1.0, 5.0, 12.0], dtype='float32')

你还可以传递一个 name 以存储在索引中：

In [329]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
In [330]: index.name
Out[330]: 'something'

如果设置了名称，将在控制台显示：

In [331]: index = pd.Index(list(range(5)), name='rows')
In [332]: columns = pd.Index(['A', 'B', 'C'], name='cols')
In [333]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
In [334]: df
Out[334]: 
cols         A         B         C
rows 
0     1.295989 -1.051694  1.340429
1    -2.366110  0.428241  0.387275
2     0.433306  0.929548  0.278094
3     2.154730 -0.315628  0.264223
4     1.126818  1.132290 -0.353310
In [335]: df['A']
Out[335]: 
rows
0    1.295989
1   -2.366110
2    0.433306
3    2.154730
4    1.126818
Name: A, dtype: float64

设置元数据

索引“基本上是不可变的”，但可以设置和更改它们的 name 属性。你可以直接使用 rename、set_names 来设置这些属性，默认情况下它们返回一个副本。

请参阅高级索引了解使用多级索引的用法。

In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')

set_names、set_levels 和 set_codes 还可以接受一个可选的 level 参数。

In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]: 
MultiIndex([(0, 'one'),
 (0, 'two'),
 (1, 'one'),
 (1, 'two'),
 (2, 'one'),
 (2, 'two')],
 names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]: 
MultiIndex([(0, 'a'),
 (0, 'b'),
 (1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['first', 'second']) 
```### 索引对象的集合操作
两个主要操作是 `union` 和 `intersection`。差异通过 `.difference()` 方法提供。
```py
In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

还提供了 symmetric_difference 操作，它返回出现在 idx1 或 idx2 中的元素，但不在两者中都出现的元素。这相当于由 idx1.difference(idx2).union(idx2.difference(idx1)) 创建的索引，重复值将被删除。

In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')

注意

从集合操作的结果索引将按升序排序。

当执行 Index.union() 时，对于具有不同数据类型的索引，索引必须转换为一个公共数据类型。通常情况下，虽然不是绝对的，这个数据类型是对象数据类型。唯一的例外是在整数和浮点数数据之间执行联合操作时。在这种情况下，整数值将被转换为浮点数。

In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64') 
```### 缺失值
重要
即使 `Index` 可以容纳缺失值（`NaN`），如果你不希望出现任何意外结果，最好避免使用它。例如，一些操作隐式排除缺失值。
`Index.fillna` 用指定的标量值填充缺失值。
```py
In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
 .....:                         pd.NaT,
 .....:                         pd.Timestamp('2011-01-03')])
 .....: 
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None) 
```### 设置元数据
索引“基本上是不可变的”，但可以设置和更改它们的`name`属性。您可以使用`rename`，`set_names`直接设置这些属性，默认返回一个副本。
请参阅高级索引以了解多重索引的用法。
```py
In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')

set_names，set_levels和set_codes还接受一个可选的level参数

In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]: 
MultiIndex([(0, 'one'),
 (0, 'two'),
 (1, 'one'),
 (1, 'two'),
 (2, 'one'),
 (2, 'two')],
 names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]: 
MultiIndex([(0, 'a'),
 (0, 'b'),
 (1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['first', 'second'])

索引对象上的集合操作

两个主要操作是union和intersection。差异通过.difference()方法提供。

In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

还提供了symmetric_difference操作，它返回出现在idx1或idx2中的元素，但不在两者中都出现的元素。这等效于由idx1.difference(idx2).union(idx2.difference(idx1))创建的索引，重复项已删除。

In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')

注意

从集合操作中得到的结果索引将按升序排序。

在具有不同数据类型的索引之间执行Index.union()时，索引必须转换为公共数据类型。通常情况下，尽管不总是如此，这是对象数据类型。唯一的例外是在整数和浮点数据之间执行联合时。在这种情况下，整数值将转换为浮点数

In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

缺失值

重要

即使Index可以容纳缺失值（NaN），如果您不希望出现任何意外结果，应该避免使用它。例如，一些操作会隐式排除缺失值。

Index.fillna使用指定的标量值填充缺失值。

In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
 .....:                         pd.NaT,
 .....:                         pd.Timestamp('2011-01-03')])
 .....: 
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

设置/重置索引

有时您会将数据集加载或创建到 DataFrame 中，并希望在已经这样做之后添加索引。有几种不同的方法。

设置索引

DataFrame 具有set_index()方法，该方法接受列名（用于常规Index）或列名列表（用于MultiIndex）。要创建一个新的重新索引的 DataFrame：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
 .....:                     'b': ['one', 'two', 'one', 'two'],
 .....:                     'c': ['z', 'y', 'x', 'w'],
 .....:                     'd': [1., 2., 3, 4]})
 .....: 
In [362]: data
Out[362]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]: 
 a    b    d
c 
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]: 
 c    d
a   b 
bar one  z  1.0
 two  y  2.0
foo one  x  3.0
 two  w  4.0

append关键字选项允许您保留现有索引并将给定列附加到 MultiIndex 中：

In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他选项允许您不删除索引列。

In [370]: data.set_index('c', drop=False)
Out[370]: 
 a    b  c    d
c 
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

重置索引

作为便利，DataFrame 上有一个名为reset_index()的新函数，它将索引值转移到 DataFrame 的列中并设置一个简单的整数索引。这是set_index()的逆操作。

In [371]: data
Out[371]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [372]: data.reset_index()
Out[372]: 
 index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0

输出更类似于 SQL 表或记录数组。从索引��生的列的名称存储在names属性中。

您可以使用level关键字仅删除索引的一部分：

In [373]: frame
Out[373]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
In [374]: frame.reset_index(level=1)
Out[374]: 
 a  c    d
c b 
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index接受一个可选参数drop，如果为 true，则简单丢弃索引，而不是将索引值放入 DataFrame 的列中。

添加一个临时索引

你可以为index属性分配一个自定义索引：

In [375]: df_idx = pd.DataFrame(range(4))
In [376]: df_idx.index = pd.Index([10, 20, 30, 40], name="a")
In [377]: df_idx
Out[377]: 
 0
a 
10  0
20  1
30  2
40  3

设置索引

DataFrame 有一个set_index()方法，它接受一个列名（用于常规Index）或一个列名列表（用于MultiIndex）。要创建一个新的、重新索引的 DataFrame：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
 .....:                     'b': ['one', 'two', 'one', 'two'],
 .....:                     'c': ['z', 'y', 'x', 'w'],
 .....:                     'd': [1., 2., 3, 4]})
 .....: 
In [362]: data
Out[362]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]: 
 a    b    d
c 
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]: 
 c    d
a   b 
bar one  z  1.0
 two  y  2.0
foo one  x  3.0
 two  w  4.0

append关键字选项允许你保留现有索引并将给定列附加到多重索引中：

In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他选项允许你不丢弃索引列。

In [370]: data.set_index('c', drop=False)
Out[370]: 
 a    b  c    d
c 
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

重置索引

为方便起见，DataFrame 上有一个名为reset_index()的新函数，它将索引值转移到 DataFrame 的列中，并设置一个简单的整数索引。这是set_index()的逆操作。

In [371]: data
Out[371]: 
 a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
In [372]: data.reset_index()
Out[372]: 
 index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0

输出更类似于 SQL 表或记录数组。从索引派生的列的名称存储在names属性中。

你可以使用level关键字仅删除索引的一部分：

In [373]: frame
Out[373]: 
 c    d
c a   b 
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
In [374]: frame.reset_index(level=1)
Out[374]: 
 a  c    d
c b 
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index接受一个可选参数drop，如果为 true，则简单丢弃索引，而不是将索引值放入 DataFrame 的列中。

添加一个临时索引

你可以为index属性分配一个自定义索引：

In [375]: df_idx = pd.DataFrame(range(4))
In [376]: df_idx.index = pd.Index([10, 20, 30, 40], name="a")
In [377]: df_idx
Out[377]: 
 0
a 
10  0
20  1
30  2
40  3

返回视图还是副本

警告

Copy-on-Write 将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再必要。查看此部分以获取更多背景信息。我们建议打开 Copy-on-Write 以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

在设置 pandas 对象的值时，必须小心避免所谓的chained indexing。这里有一个例子。

In [378]: dfmi = pd.DataFrame([list('abcd'),
 .....:                     list('efgh'),
 .....:                     list('ijkl'),
 .....:                     list('mnop')],
 .....:                    columns=pd.MultiIndex.from_product([['one', 'two'],
 .....:                                                        ['first', 'second']]))
 .....: 
In [379]: dfmi
Out[379]: 
 one          two 
 first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比较这两种访问方法：

In [380]: dfmi['one']['second']
Out[380]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [381]: dfmi.loc[:, ('one', 'second')]
Out[381]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

这两者产生相同的结果，那么应该使用哪一个？了解这些操作的顺序以及为什么方法 2（.loc）比方法 1（chained []）更受欢迎是很有启发性的。

dfmi['one']选择列的第一级，并返回一个单索引的 DataFrame。然后另一个 Python 操作dfmi_with_one['second']选择由'second'索引的系列。这由变量dfmi_with_one表示，因为 pandas 将这些操作视为独立事件。例如，独立调用__getitem__，因此必须将它们视为线性操作，它们一个接一个地发生。

与 df.loc[:,('one','second')] 相比，它传递了一个嵌套元组 (slice(None),('one','second')) 到一个对 __getitem__ 的单次调用。这使得 pandas 能够将其视为一个单一实体处理。此外，这种操作顺序可能明显更快，并且允许在需要时索引两个轴。

使用链式索引时为什么赋值会失败？

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning 将不再必要。查看此部分以获取更多上下文。我们建议打开写时复制以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

前一节中的问题只是一个性能问题。SettingWithCopy 警告是怎么回事？当你做一些可能多花几毫秒的事情时，我们通常不会发出警告！

但事实证明，对链式索引的乘积进行赋值具有固有的不可预测结果。要了解这一点，请考虑 Python 解释器如何执行此代码：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但是这段代码处理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到那里的 __getitem__ 了吗？除了简单情况外，很难预测它是否会返回视图或副本（这取决于数组的内存布局，pandas 对此不做任何保证），因此 __setitem__ 是否会修改 dfmi 或立即丢弃的临时对象。这就是 SettingWithCopy 警告您的原因！

注意

也许你会想知道我们是否应该担心第一个示例中的 loc 属性。但是 dfmi.loc 保证是 dfmi 本身，具有修改后的索引行为，因此 dfmi.loc.__getitem__ / dfmi.loc.__setitem__ 直接在 dfmi 上操作。当然，dfmi.loc.__getitem__(idx) 可能是 dfmi 的视图或副本。

有时会在没有明显的链式索引的情况下出现 SettingWithCopy 警告。这些是 SettingWithCopy 设计用来捕捉的错误！pandas 可能试图警告您已经这样做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

哎呀！

评估顺序很重要

警告

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

当您使用链式索引时，索引操作的顺序和类型部分确定结果是原始对象的切片，还是切片的副本。

pandas 具有SettingWithCopyWarning，因为将分片的副本分配给链式索引通常不是有意的，而是由链式索引返回副本而不是预期的分片而导致的错误。

如果您希望 pandas 对链式索引表达式的分配更加信任或不信任，可以将选项mode.chained_assignment设置为以下值之一：

'warn'，默认值，意味着会打印一个SettingWithCopyWarning。
'raise'意味着 pandas 将引发SettingWithCopyError，您必须处理。
None将完全禁止警告。

In [382]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [383]: dfb['c'][dfb['a'].str.startswith('o')] = 42

然而，这在副本上运行，将不起作用。

In [384]: with pd.option_context('mode.chained_assignment','warn'):
 .....:    dfb[dfb['a'].str.startswith('o')]['c'] = 42
 .....:

链式赋值也可能在设置混合 dtype 帧时出现。

注意

这些设置规则适用于所有.loc/.iloc。

以下是使用.loc进行多个项目（使用mask）和使用固定索引进行单个项目的推荐访问方法：

In [385]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
In [386]: dfd = dfc.copy()
# Setting multiple items using a mask
In [387]: mask = dfd['a'].str.startswith('o')
In [388]: dfd.loc[mask, 'c'] = 42
In [389]: dfd
Out[389]: 
 a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6
# Setting a single item
In [390]: dfd = dfc.copy()
In [391]: dfd.loc[2, 'a'] = 11
In [392]: dfd
Out[392]: 
 a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

以下内容可能有时可以工作，但不能保证，因此应该避免：

In [393]: dfd = dfc.copy()
In [394]: dfd['a'][2] = 111
In [395]: dfd
Out[395]: 
 a  c
0    one  0
1    one  1
2    111  2
3  three  3
4    two  4
5    one  5
6    six  6

最后，后续示例将完全无法工作，因此应该避免：

In [396]: with pd.option_context('mode.chained_assignment','raise'):
 .....:    dfd.loc[0]['a'] = 1111
 .....: 
---------------------------------------------------------------------------
SettingWithCopyError  Traceback (most recent call last)
<ipython-input-396-32ce785aaa5b> in ?()
  1 with pd.option_context('mode.chained_assignment','raise'):
----> 2     dfd.loc[0]['a'] = 1111
~/work/pandas/pandas/pandas/core/series.py in ?(self, key, value)
  1284                 )
  1285 
  1286         check_dict_or_set_indexers(key)
  1287         key = com.apply_if_callable(key, self)
-> 1288         cacher_needs_updating = self._check_is_chained_assignment_possible()
  1289 
  1290         if key is Ellipsis:
  1291             key = slice(None)
~/work/pandas/pandas/pandas/core/series.py in ?(self)
  1489             ref = self._get_cacher()
  1490             if ref is not None and ref._is_mixed_type:
  1491                 self._check_setitem_copy(t="referent", force=True)
  1492             return True
-> 1493         return super()._check_is_chained_assignment_possible()
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  4395         single-dtype meaning that the cacher should be updated following
  4396         setting.
  4397         """
  4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
  4400 return False
~/work/pandas/pandas/pandas/core/generic.py in ?(self, t, force)
  4469 "indexing.html#returning-a-view-versus-a-copy"
  4470 )
  4471  
  4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
  4474 if value == "warn":
  4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

警告

链式赋值警告/异常旨在通知用户可能无效的赋值。可能存在误报的情况；链式赋值意外报告的情况。

当使用链式索引时为什么分配失败？

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再需要。有关更多上下文，请参阅本节。我们建议打开写时复制以利用改进

pd.options.mode.copy_on_write = True

甚至在 pandas 3.0 可用之前。

前一节中的问题只是一个性能问题。那么SettingWithCopy警告是什么情况？当您执行可能需要额外几毫秒的操作时，我们通常不会发出警告！

但事实证明，将链式索引的产品赋值给它具有不可预测的结果。要看到这一点，想象一下 Python 解释器如何执行此代码：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但是此代码处理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到那里的__getitem__了吗？在简单情况之外，很难预测它是否会返回视图或副本（这取决于数组的内存布局，pandas 对此不提供任何保证），因此无论__setitem__是否会修改dfmi还是立即被丢弃的临时对象。这就是SettingWithCopy警告您的内容！

注意

您可能想知道我们是否应该担心第一个示例中的loc属性。但是保证dfmi.loc将是具有修改索引行为的dfmi本身，因此dfmi.loc.__getitem__ / dfmi.loc.__setitem__直接在dfmi上操作。当然，dfmi.loc.__getitem__(idx)可能是dfmi的视图或副本。

有时会在没有明显的链式索引的情况下出现SettingWithCopy警告。这些是SettingWithCopy旨在捕获的错误！pandas 可能试图警告您已经这样做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

哎呀！

评估顺序很重要

警告

写时复制将成为 pandas 3.0 的新默认设置。这意味着链式索引永远不会起作用。因此，SettingWithCopyWarning将不再必要。请参见本节了解更多上下文。我们建议打开 Copy-on-Write 以利用改进

pd.options.mode.copy_on_write = True

即使在 pandas 3.0 可用之前。

当您使用链式索引时，索引操作的顺序和类型部分确定结果是原始对象的切片，还是切片的副本。

pandas 有SettingWithCopyWarning，因为给切片的副本赋值通常不是有意的，而是由链式索引返回副本而预期切片引起的错误。

如果您希望 pandas 在对链式索引表达式进行赋值时更加信任或不信任，可以将选项 mode.chained_assignment 设置为以下值之一：

'warn'，默认值，意味着将打印SettingWithCopyWarning。
'raise' 表示 pandas 将引发SettingWithCopyError，您必须处理。
None 将完全抑制警告。

In [382]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [383]: dfb['c'][dfb['a'].str.startswith('o')] = 42

然而，这是在副本上操作的，不会起作用。

In [384]: with pd.option_context('mode.chained_assignment','warn'):
 .....:    dfb[dfb['a'].str.startswith('o')]['c'] = 42
 .....:

链式赋值也可能在设置混合 dtype 框架时出现。

注意

这些设置规则适用于所有.loc/.iloc。

以下是使用.loc进行多个项目的推荐访问方法（使用mask）以及使用固定索引访问单个项目：

In [385]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
 .....:                          'three', 'two', 'one', 'six'],
 .....:                    'c': np.arange(7)})
 .....: 
In [386]: dfd = dfc.copy()
# Setting multiple items using a mask
In [387]: mask = dfd['a'].str.startswith('o')
In [388]: dfd.loc[mask, 'c'] = 42
In [389]: dfd
Out[389]: 
 a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6
# Setting a single item
In [390]: dfd = dfc.copy()
In [391]: dfd.loc[2, 'a'] = 11
In [392]: dfd
Out[392]: 
 a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

以下有时可以工作，但不能保证，因此应避免使用：

In [393]: dfd = dfc.copy()
In [394]: dfd['a'][2] = 111
In [395]: dfd
Out[395]: 
 a  c
0    one  0
1    one  1
2    111  2
3  three  3
4    two  4
5    one  5
6    six  6

最后，以下后续示例将完全不起作用，因此应避免使用：

In [396]: with pd.option_context('mode.chained_assignment','raise'):
 .....:    dfd.loc[0]['a'] = 1111
 .....: 
---------------------------------------------------------------------------
SettingWithCopyError  Traceback (most recent call last)
<ipython-input-396-32ce785aaa5b> in ?()
  1 with pd.option_context('mode.chained_assignment','raise'):
----> 2     dfd.loc[0]['a'] = 1111
~/work/pandas/pandas/pandas/core/series.py in ?(self, key, value)
  1284                 )
  1285 
  1286         check_dict_or_set_indexers(key)
  1287         key = com.apply_if_callable(key, self)
-> 1288         cacher_needs_updating = self._check_is_chained_assignment_possible()
  1289 
  1290         if key is Ellipsis:
  1291             key = slice(None)
~/work/pandas/pandas/pandas/core/series.py in ?(self)
  1489             ref = self._get_cacher()
  1490             if ref is not None and ref._is_mixed_type:
  1491                 self._check_setitem_copy(t="referent", force=True)
  1492             return True
-> 1493         return super()._check_is_chained_assignment_possible()
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  4395         single-dtype meaning that the cacher should be updated following
  4396         setting.
  4397         """
  4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
  4400 return False
~/work/pandas/pandas/pandas/core/generic.py in ?(self, t, force)
  4469 "indexing.html#returning-a-view-versus-a-copy"
  4470 )
  4471  
  4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
  4474 if value == "warn":
  4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

警告

链式赋值警告/异常旨在通知用户可能无效的赋值。可能会出现误报；链式赋值被错误地报告的情况。

Pandas 2.2 中文官方教程和指南（十一·二）（4）

索引对象

设置元数据

索引对象上的集合操作

缺失值

设置/重置索引

设置索引

重置索引

添加一个临时索引

设置索引

重置索引

添加一个临时索引

返回视图还是副本

使用链式索引时为什么赋值会失败？

评估顺序很重要

当使用链式索引时为什么分配失败？

评估顺序很重要

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Pandas 2.2 中文官方教程和指南（十一·二）（4）

索引对象

设置元数据

索引对象上的集合操作

缺失值

设置/重置索引

设置索引

重置索引

添加一个临时索引

设置索引

重置索引

添加一个临时索引

返回视图还是副本

使用链式索引时为什么赋值会失败？

评估顺序很重要

当使用链式索引时为什么分配失败？

评估顺序很重要

热门文章

最新文章

相关课程

相关电子书

相关实验场景