解析天池大赛O2O第一名 extract_feature.py 提取特征

本文涉及的产品
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
全局流量管理 GTM,标准版 1个月
云解析 DNS,旗舰版 1个月
简介: 读取数据dataset split:(date_received)dateset3: 20160701~20160731 (113640),features3 from 20160315~20160630 (off_test)datese...

读取数据

  • dataset split:
    • (date_received)
    • dateset3: 20160701~20160731 (113640),features3 from 20160315~20160630 (off_test)
    • dateset2: 20160515~20160615 (258446),features2 from 20160201~20160514
    • dateset1: 20160414~20160514 (138303),features1 from 20160101~20160413
import pandas as pd
import numpy as np
from datetime import date

off_train = pd.read_csv('data/ccf_offline_stage1_train.csv',header=None,dtype='str')
off_train.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']
off_train = off_train.where(off_train.notnull(),'null')

off_test = pd.read_csv('data/ccf_offline_stage1_test_revised.csv',header=None)
off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received']
off_test = off_test.where(off_test.notnull(),'null')
#11429826 record(872357 with coupon_id),762858 user(267448 in off_train)
on_train = pd.read_csv('data/ccf_online_stage1_train.csv',header=None)
on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date']
on_train = on_train.where(on_train.notnull(),'null')

dataset3 = off_test
feature3 = off_train[((off_train.date>='20160315')&(off_train.date<='20160630'))|((off_train.date=='null')&(off_train.date_received>='20160315')&(off_train.date_received<='20160630'))]
dataset2 = off_train[(off_train.date_received>='20160515')&(off_train.date_received<='20160615')]
feature2 = off_train[(off_train.date>='20160201')&(off_train.date<='20160514')|((off_train.date=='null')&(off_train.date_received>='20160201')&(off_train.date_received<='20160514'))]
dataset1 = off_train[(off_train.date_received>='20160414')&(off_train.date_received<='20160514')]
feature1 = off_train[(off_train.date>='20160101')&(off_train.date<='20160413')|((off_train.date=='null')&(off_train.date_received>='20160101')&(off_train.date_received<='20160413'))]


OtherFeatures 特征表构建

  • other feature:
    • this_month_user_receive_all_coupon_count
    • this_month_user_receive_same_coupon_count
    • this_month_user_receive_same_coupon_lastone
    • this_month_user_receive_same_coupon_firstone
    • this_day_user_receive_all_coupon_count
    • this_day_user_receive_same_coupon_count
    • day_gap_before, day_gap_after (receive the same coupon)
# 这部分是对dataset3进行特征提取,提取结果输出到'data/other_feature3.csv',后面还需要分别对dataset2和dataset1提取OtherFeatures
t = dataset3[['user_id']]

t['this_month_user_receive_all_coupon_count'] = 1

t = t.groupby('user_id').agg('sum').reset_index()

t1 = dataset3[['user_id','coupon_id']]
t1['this_month_user_receive_same_coupon_count'] = 1
t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index()

t2 = dataset3[['user_id','coupon_id','date_received']]
t2.date_received = t2.date_received.astype('str')

t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()

t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':')))

t2 = t2[t2.receive_number>1]

t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')]))
t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')]))
t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']]


t3 = dataset3[['user_id','coupon_id','date_received']]

t3=t3.drop([0])  

t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left')

t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype(int)  
#这个优惠券最远接受时间  
t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype(int) -t3.min_date_received  

def is_firstlastone(x):
    if x==0:
        return 1
    elif x>0:
        return 0
    else:
        return -1 #those only receive once
        
t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone)
t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone)
t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']]



t4 = dataset3[['user_id','date_received']]
t4['this_day_user_receive_all_coupon_count'] = 1
t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index()

t5 = dataset3[['user_id','coupon_id','date_received']]
t5['this_day_user_receive_same_coupon_count'] = 1
t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index()



t6 = dataset3[['user_id','coupon_id','date_received']]
t6.date_received = t6.date_received.astype('str')
t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
t6.rename(columns={'date_received':'dates'},inplace=True)


def get_day_gap_before(s):
    date_received,dates = s.split('-')
    dates = dates.split(':')
    gaps = []
    for d in dates:
        this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days
        if this_gap>0:
            gaps.append(this_gap)
    if len(gaps)==0:
        return -1
    else:
        return min(gaps)
        
def get_day_gap_after(s):
    date_received,dates = s.split('-')
    dates = dates.split(':')
    gaps = []
    for d in dates:
        this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days
        if this_gap>0:
            gaps.append(this_gap)
    if len(gaps)==0:
        return -1
    else:
        return min(gaps)
    

t7 = dataset3[['user_id','coupon_id','date_received']]
t7 = t7.drop([0])
t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left')
t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates
t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']]


other_feature3 = pd.merge(t1,t,on='user_id')
other_feature3 = pd.merge(other_feature3,t3,on=['user_id','coupon_id'])
other_feature3 = pd.merge(other_feature3,t4,on=['user_id','date_received'])
other_feature3 = pd.merge(other_feature3,t5,on=['user_id','coupon_id','date_received'])
other_feature3 = pd.merge(other_feature3,t7,on=['user_id','coupon_id','date_received'])
other_feature3.to_csv('data/other_feature3.csv',index=None)
print(other_feature3.shape)

以同样的方式对进行其他特征的提取,不再赘述

generate training and testing set 训练测试数据集分类

def get_label(s):
    s = s.split(':')
    if s[0]=='nan':
        return 0
    elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15:
        return 1
    else:
        return -1


coupon3 = pd.read_csv('data/coupon3_feature.csv')
merchant3 = pd.read_csv('data/merchant3_feature.csv')
user3 = pd.read_csv('data/user3_feature.csv')
user_merchant3 = pd.read_csv('data/user_merchant3.csv')
other_feature3 = pd.read_csv('data/other_feature3.csv')
dataset3 = pd.merge(coupon3,merchant3,on='merchant_id',how='left')
dataset3 = pd.merge(dataset3,user3,on='user_id',how='left')
dataset3 = pd.merge(dataset3,user_merchant3,on=['user_id','merchant_id'],how='left')
dataset3 = pd.merge(dataset3,other_feature3,on=['user_id','coupon_id','date_received'],how='left')
dataset3.drop_duplicates(inplace=True)
print(dataset3.shape)

dataset3.user_merchant_buy_total = dataset3.user_merchant_buy_total.replace(np.nan,0)
dataset3.user_merchant_any = dataset3.user_merchant_any.replace(np.nan,0)
dataset3.user_merchant_received = dataset3.user_merchant_received.replace(np.nan,0)
dataset3['is_weekend'] = dataset3.day_of_week.apply(lambda x:1 if x in (6,7) else 0)


weekday_dummies = pd.get_dummies(dataset3.day_of_week)

weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]

dataset3 = pd.concat([dataset3,weekday_dummies],axis=1)
dataset3.drop(['merchant_id','day_of_week','coupon_count'],axis=1,inplace=True)

dataset3 = dataset3.replace('null',np.nan)
dataset3.to_csv('data/dataset3.csv',index=None)



coupon2 = pd.read_csv('data/coupon2_feature.csv')
merchant2 = pd.read_csv('data/merchant2_feature.csv')
user2 = pd.read_csv('data/user2_feature.csv')
user_merchant2 = pd.read_csv('data/user_merchant2.csv')
other_feature2 = pd.read_csv('data/other_feature2.csv')
dataset2 = pd.merge(coupon2,merchant2,on='merchant_id',how='left')
dataset2 = pd.merge(dataset2,user2,on='user_id',how='left')
dataset2 = pd.merge(dataset2,user_merchant2,on=['user_id','merchant_id'],how='left')
dataset2 = pd.merge(dataset2,other_feature2,on=['user_id','coupon_id','date_received'],how='left')
dataset2.drop_duplicates(inplace=True)
print(dataset2.shape)

dataset2.user_merchant_buy_total = dataset2.user_merchant_buy_total.replace(np.nan,0)
dataset2.user_merchant_any = dataset2.user_merchant_any.replace(np.nan,0)
dataset2.user_merchant_received = dataset2.user_merchant_received.replace(np.nan,0)
dataset2['is_weekend'] = dataset2.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
weekday_dummies = pd.get_dummies(dataset2.day_of_week)
weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
dataset2 = pd.concat([dataset2,weekday_dummies],axis=1)

dataset2['label'] = dataset2.date.astype('str') + ':' +  dataset2.date_received.astype('str')
dataset2.label = dataset2.label.apply(get_label)

dataset2.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True)
dataset2 = dataset2.replace('null',np.nan)
dataset2.to_csv('data/dataset2.csv',index=None)



coupon1 = pd.read_csv('data/coupon1_feature.csv')
merchant1 = pd.read_csv('data/merchant1_feature.csv')
user1 = pd.read_csv('data/user1_feature.csv')
user_merchant1 = pd.read_csv('data/user_merchant1.csv')
other_feature1 = pd.read_csv('data/other_feature1.csv')
dataset1 = pd.merge(coupon1,merchant1,on='merchant_id',how='left')
dataset1 = pd.merge(dataset1,user1,on='user_id',how='left')
dataset1 = pd.merge(dataset1,user_merchant1,on=['user_id','merchant_id'],how='left')
dataset1 = pd.merge(dataset1,other_feature1,on=['user_id','coupon_id','date_received'],how='left')
dataset1.drop_duplicates(inplace=True)
print(dataset1.shape)

dataset1.user_merchant_buy_total = dataset1.user_merchant_buy_total.replace(np.nan,0)
dataset1.user_merchant_any = dataset1.user_merchant_any.replace(np.nan,0)
dataset1.user_merchant_received = dataset1.user_merchant_received.replace(np.nan,0)
dataset1['is_weekend'] = dataset1.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
weekday_dummies = pd.get_dummies(dataset1.day_of_week)
weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
dataset1 = pd.concat([dataset1,weekday_dummies],axis=1)
dataset1['label'] = dataset1.date.astype('str') + ':' +  dataset1.date_received.astype('str')
dataset1.label = dataset1.label.apply(get_label)
dataset1.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True)
dataset1 = dataset1.replace('null',np.nan)
dataset1.to_csv('data/dataset1.csv',index=None)

目录
相关文章
|
机器学习/深度学习 算法 C++
2019第十届蓝桥杯大赛青少年创意编程省赛C++组试题解析
2019第十届蓝桥杯大赛青少年创意编程省赛C++组试题解析
435 0
|
机器学习/深度学习 传感器 数据采集
阿里云天池大赛赛题解析——机器学习篇-赛题一(1)
阿里云是国内知名的云计算、大数据、人工智能技术型公司,是阿里巴巴集团最重要的技术部门。阿里云天池是阿里云面向开发者和教育行业的资源输出部门,天池大赛是国内最大规模的人工智能算法赛事,致力于汇聚全球AI精英为企业解决真实问题。自2014年至今已举办数十次行业顶级算法赛事,全球参赛开发者超过30万人。然而对于更广大的普通开发者和大学生群体来说,高规格的算法大赛仍然具有很高的门槛。本书就是针对受众最广泛的新手人群而编写的,精选阿里巴巴最典型的人工智能算法应用案例,邀请天池大赛最顶级的获奖选手联合编撰,公开那些鲜为人知的技术秘籍,力图使每一个涉足数据智能算法技术的开发者从中获益......
阿里云天池大赛赛题解析——机器学习篇-赛题一(1)
|
8月前
|
机器学习/深度学习 编解码 算法
【论文解析】CFPNet:用于目标检测的集中特征金字塔
【论文解析】CFPNet:用于目标检测的集中特征金字塔
524 0
【论文解析】CFPNet:用于目标检测的集中特征金字塔
|
网络协议 安全 Linux
深入解析HTTP请求:了解请求特征与报文格式的关键秘密
这篇文章将带您深入了解HTTP请求的特征和报文格式。HTTP作为一种简单、灵活且易于扩展的协议,适用于各种操作系统和设备。我们还将探讨持久性连接如何提高请求的效率。了解HTTP报文的构成,包括起始行、头部字段和消息正文,将帮助您更好地理解HTTP的工作原理。无论您是初学者还是已经有一定了解的读者,本文都将为您提供全面的HTTP知识。
375 1
深入解析HTTP请求:了解请求特征与报文格式的关键秘密
|
安全 前端开发 网络安全
2021年江苏省职业院校技能大赛中职 网络信息安全赛项试卷--web安全渗透测试解析
2021年江苏省职业院校技能大赛中职 网络信息安全赛项试卷--web安全渗透测试解析
245 0
|
机器学习/深度学习 负载均衡 算法
互联网+大赛-龙蜥社区赛题解析(基于 Pluasched 优化内核调度器)|学习笔记
快速学习互联网+大赛-龙蜥社区赛题解析(基于 Pluasched 优化内核调度器)
互联网+大赛-龙蜥社区赛题解析(基于 Pluasched 优化内核调度器)|学习笔记
|
存储 Cloud Native NoSQL
互联网+大赛-阿里云数据库赛题解析|学习笔记
快速学习互联网+大赛-阿里云数据库赛题解析
互联网+大赛-阿里云数据库赛题解析|学习笔记
|
机器学习/深度学习 存储 弹性计算
互联网大赛+命题解析-大规模云计算系统中计算子系统故障预测|学习笔记
快速学习互联网大赛+命题解析-大规模云计算系统中计算子系统故障预测
|
存储 Cloud Native 固态存储
赛题解析 | 第四届全球数据库大赛赛道2:分布式NewSQL性能挑战
第四届全球数据库大赛—PolarDB性能挑战赛,瓜分40万奖金池,只等你来!各赛道TOP10都可获得现金奖励
1493 0
赛题解析 | 第四届全球数据库大赛赛道2:分布式NewSQL性能挑战
|
存储 Cloud Native NoSQL
赛题解析 | 第四届全球数据库大赛——赛道1: 云原生共享内存数据库性能优化
第四届全球数据库大赛—PolarDB性能挑战赛,瓜分40万奖金池,只等你来!各赛道TOP10都可获得现金奖励
694 0
赛题解析 | 第四届全球数据库大赛——赛道1: 云原生共享内存数据库性能优化

热门文章

最新文章

推荐镜像

更多