根据行中阈值获取前'n'列的名称_问答-阿里云开发者社区

In [1]: df Out[1]: Student_Name Maths Physics Chemistry Biology English 0 John Doe 90 87 81 65 70 1 Jane Doe 82 84 75 73 77 2 Mary Lim 40 65 55 60 70 3 Lisa Ray 55 52 77 62 90

In [3]: df Out[3]: Student_Name Maths Physics Chemistry Biology English Top_3_above_80 0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry 1 Jane Doe 82 84 75 73 77 Physics, Maths 2 Mary Lim 40 65 55 60 70 nan 3 Lisa Ray 55 52 77 62 90 English

想法是首先通过丢失DataFrame.where中的值来替换不匹配的值，然后使用numpy.argsort应用解决方案。根据numpy.where中的count个不丢失的值按True的数量进行过滤，以将不匹配的值替换为空字符串。

最后是将值加入列表推导中，并过滤掉缺少值的不匹配行：

df1 = df.iloc[:, 1:]

m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]

m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')

L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
  Student_Name  Maths  Physics  Chemistry  Biology  English  \
0     John Doe     90       87         81       65       70   
1     Jane Doe     82       84         75       73       77   
2     Mary Lim     40       65         55       60       70   
3     Lisa Ray     55       52         77       62       90   

              Top_3_above_80  
0  Maths, Physics, Chemistry  
1             Physics, Maths  
2                        NaN  
3                    English

如果性能不重要，则每行使用Series.nlargest，但是如果使用大的DataFrame的话，它的确很慢：

df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)

df['Top_3_above_80'] = (df1.where(m)
                           .apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
  Student_Name  Maths  Physics  Chemistry  Biology  English  \
0     John Doe     90       87         81       65       70   
1     Jane Doe     82       84         75       73       77   
2     Mary Lim     40       65         55       60       70   
3     Lisa Ray     55       52         77       62       90   

              Top_3_above_80  
0  Maths, Physics, Chemistry  
1             Physics, Maths  
2                        NaN  
3                    English

性能：

#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)

def f1(df):
    df1 = df.iloc[:, 1:]
    m = df1 > 80
    count = m.sum(axis=1)
    arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]

    m = np.arange(arr.shape[1]) < count[:, None]
    a = np.where(m, arr, '')

    L = [', '.join(x).strip(', ') for x in a]
    df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
    return df

def f2(df):
    df1 = df.iloc[:, 1:]
    m = df1 > 80
    count = m.sum(axis=1)

    df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
    return df

In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答来源：stackoverflow

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

根据行中阈值获取前'n'列的名称