开发者社区> 问答> 正文

分组并根据Pandas中的多种条件计算计数和均值

对于给定的数据帧,如下所示:

  id|address|sell_price|market_price|status|start_date|end_date
  1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
  1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
  2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
  2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
  2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
  3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
  3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
  3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
  3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
  4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
  4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
  4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
  5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
  5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
  5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9

#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|') 

我想根据以下条件对idaddress进行分组,并计算mean_ratioresult_count

  1. ` mean_ratio ` : which is groupby ` id ` and ` address ` and calculate mean for the rows meet the following conditions: ` status ` is ` finished ` and ` start_date ` isin the range of ` 2019-09 ` and ` 2019-10 `
  2. ` result_count ` : which is groupby ` id ` and ` address ` and count the rows meet the following conditions: ` status ` is either ` finished ` or ` failed ` , and ` start_date ` isin the range of ` 2019-09 ` and ` 2019-10 `

所需的输出将如下所示:

   id               address  mean_ratio  result_count
0   1    7552 Atlantic Lane         NaN             0
1   2     888 Foster Street        1.32             1
2   3       5 Pawnee Avenue        1.25             1
3   4  916 W. Mill Pond St.        1.44             3
4   5        68 Henry Drive         NaN             2

到目前为止,我已经尝试过:

# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)

为了过滤start_date2019-092019-10的范围内:

L = [pd.Period('2019-09'), pd.Period('2019-10')] 
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]

要过滤行状态为“完成”或“失败”,我使用:

mask = df['status'].str.contains('finished|failed')
df[mask]

但是我不知道如何使用这些来获得最终结果。预先感谢您的帮助。

问题来源:stackoverflow

展开
收起
is大龙 2020-03-24 12:10:39 898 0
1 条回答
写回答
取消 提交回答
  • 我认为您需要GroupBy.agg,但是由于某些行被排除在外,例如id = 1,然后通过DataFrame.join将它们添加在一起,并在df2中添加了所有唯一对idaddress,最后替换在result_count列中缺少值:

    df2 = df[['id','address']].drop_duplicates()
    print (df2)
        id               address
    0    1    7552 Atlantic Lane
    2    2     888 Foster Street
    5    3       5 Pawnee Avenue
    9    4  916 W. Mill Pond St.
    12   5        68 Henry Drive
    
    df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
    df['ratio'] = round(df['sell_price']/df['market_price'], 2)
    L = [pd.Period('2019-09'), pd.Period('2019-10')] 
    c = ['start_date']
    
    mask = df['status'].str.contains('finished|failed')
    mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
    
    df = df[mask1 & mask]
    
    df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
                                            result_count=('ratio','size'))
    
    df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
    print (df1)
        id               address  mean_ratio  result_count
    0    1    7552 Atlantic Lane         NaN           0.0
    2    2     888 Foster Street    1.320000           1.0
    5    3       5 Pawnee Avenue    1.250000           1.0
    9    4  916 W. Mill Pond St.    1.436667           3.0
    12   5        68 Henry Drive         NaN           2.0
    

    回答来源:stackoverflow

    2020-03-24 12:10:46
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
中文:即学即用的Pandas入门与时间序列分析 立即下载
即学即用的Pandas入门与时间序列分析 立即下载
低代码开发师(初级)实战教程 立即下载