一、比赛介绍
- 广告欺诈是数字营销需要面临的重要挑战之一,点击会欺诈浪费广告主大量金钱,同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意:我们对数据进行了模拟生成,对某些特征含义进行了隐藏,并进行了脱敏处理。
- 请预测用户的点击行为是否为正常点击,还是作弊行为。点击欺诈预测适用于各种信息流广告投放,banner广告投放,以及百度网盟平台,帮助商家鉴别点击欺诈,锁定精准真实用户。
二、赛题重点难点剖析
- 本次比赛的一个重点是如何对特征进行甄别和处理,以及如何利用原有特征构建新特征。
- 由业务场景可知,点击反欺诈预测中一个重要的特征是点击的数量,点击作弊往往会出现重复点击的情况,所以在原特征基础上构建相应的数量特征是本次建模的一个重点。
三、思路介绍(经典机器学习+百度深度学习模型)
- 本模型采用XGBoost和百度palm语言模型进行融合,前期做了很多为XGBoost准备的特征工程
- 由于深度学习对特征工程的要求不大,故只对palm模型需要的数据进行了缺失值补充
- 提交时所使用的checkpoint: ./outputs/ckpt.step480000
四、具体方案分享
一、XGBoost特征处理过程
#读取数据集 import pandas as pd train= pd.read_csv('./train.csv',encoding='utf-8') test = pd.read_csv('./test1.csv',encoding='utf-8') features = train.drop(['Unnamed: 0','label'],axis=1) labels = train['label']
1.构造数量特征
- 用户特征编码 fea1_hash和fea_hash是不同数字组成的序列,推测可能代表着用户不同类别的行为的多少;此处用fea1_hash的长度构建新特征,模型准确度有一定提高,大家可根据不同的数字序列尝试构建更多特征来进行尝试。
- 此外,作弊点击往往会有重复点击的行为,要么是以同一个用户重复点击,要么是多个用户重复点击某个媒体源,于是我利用一部分用户信息(比如id,设备和版本信息)和媒体信息构造了数量特征,模型效果有一定程度的提高。
#fea1_hash和fea_hash有不同长度,可能代表着用户行为的多少,此处用fea1_hash的长度构建新特征,模型准确度有一定提高(后期可根据不同的数字序列尝试再分类) features['fea1_hash4']=features['fea1_hash'].map(lambda x:len(str(x))) train['fea1_hash4']=test['fea1_hash'].map(lambda x:len(str(x))) test['fea1_hash4']=test['fea1_hash'].map(lambda x:len(str(x))) #新特征函数,利用数量特征构造。 def ded(x): result = pd.value_counts(x) x= [result[each] for each in x] return x #合并测试集和训练集 all_df = pd.concat([train,test]) #尝试出构造数量特征后有提升的原有特征 s = ['dev_height','dev_width','media_id','package','apptype','android_id','fea1_hash','fea_hash'] for f in s: all_df[f]=ded(all_df[f]) train[f+'2'] = all_df[all_df['label'].notnull()][f] test[f+'2'] = all_df[all_df['label'].isnull()][f] features[f+'2'] = all_df[all_df['label'].notnull()][f]
2.数据清洗1-osv
#数据清洗osv def f(x): if str(x) == 'nan': return x else:x = str(x) y = x r = ''; for i in range(len(x)): if x[i].isdigit(): r +=x[i] if r == '': return 0 else: while(r[0]=='0'): r=r[1:] k = int((str(r)+'000')[:5]) while(k>12): k=k/10 return float(k) train['osv'] = train['osv'].apply(f) features['osv'] = features['osv'].apply(f) test['osv'] = test['osv'].apply(f) #类别特征,后期统一labelencoder cate_features = ['apptype','carrier','ntt','location','cus_type','media_id', 'dev_width','dev_height','android_id','fea1_hash']
3.新特征构造-利用时间戳
- 时间戳跨越的时间并不是很长,所以对模型整体表现影响不大
- 此处我将时间戳转化为小时数,并以最大时间减去最小时间得出样本采集的时间段,并且将时间段平均分成n份分桶进行聚类, 采用这种方法主要是基于作弊点击可能会在一个时间段内集中出现的假设
- 模型效果有一定程度的提高,大家也可以使用其他方法处理时间戳
from datetime import datetime as dt """ 个人处理日期的想法:直接分桶,分段聚类 """ #处理日期,得到基于小时数的timediff def get_date(features): features['timestamp'] = features['timestamp'].apply(lambda x: dt.fromtimestamp(x/1000)) start_time = features['timestamp'].min() features['time_diff'] = features['timestamp'] - start_time features['time_diff'] = features['time_diff'].dt.days*24 + features['time_diff'].dt.seconds/3600 features.drop(['timestamp'],axis=1,inplace = True) return features features = get_date(features) test = get_date(test) train = get_date(train) mini = features['time_diff'].min() #尝试出不同的聚类距离,13时提升效果最大 def ts(x): return (x - mini)//13 features['time_diff'] = features['time_diff'].apply(ts) train['time_diff'] = train['time_diff'].apply(ts) test['time_diff'] = test['time_diff'].apply(ts) cate_features.append('time_diff')
4.其他数据处理
- 此处我去掉了过长的特征值,并将测试集和训练集进行合并,方便Labelencoder进行拟合
- 大家可以选择性地清洗lan,也就是设备语言特征,其中有多个大小写不同,但都表示中文的特征值。因为它对效果的影响不大,反而可能对结果产生干扰,我在这里选择不去处理它。
#去掉过长的特征值 features['fea_hash']=features['fea_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) features['fea1_hash']=features['fea1_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) test['fea_hash']=test['fea_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) test['fea1_hash']=test['fea1_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) train['fea_hash']=test['fea_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) train['fea1_hash']=test['fea1_hash'].map(lambda x:0 if len(str(x))>16 else int(x)) from sklearn.preprocessing import LabelEncoder le = LabelEncoder() #特征清洗lan # def low(row): # return row.lower().replace('-',"").replace('_',"") #将训练集和测试集合并 all_df = pd.concat([train,test]) all_df['lan']= all_df['lan'].astype('str')#.apply(low) all_df['lan'] = le.fit_transform(all_df['lan']) all_df['fea_hash']= all_df['fea_hash'].astype('str') all_df['fea_hash'] = le.fit_transform(all_df['fea_hash'])
#移除不必要的特征 nonuse = ['os','sid'] col = features.columns.tolist() for i in nonuse: col.remove(i) features = features[col]
5.特征清洗2-version + labelencoder
- 此处将类别特征进行了labelencoder的拟合以及转化,方便送入机器学习模型进行训练(大家也可以采用one-hot,embedding等等方法)
- 注意:labelencoder需要拟合训练集和测试集的合并集,如果只拟合训练集,去编码测试集时就会出现没有见到过的特征从而导致一些问题。
#特征清洗version def rep(x): if str(x).isdigit():return int(x) elif str(x)[0] == "v" or "V": if str(x)[1:].isdigit(): return int(str(x)[1:]) else:return 0 features['version'] = features['version'].apply(rep) all_df['version'] = all_df['version'].apply(rep) train['version'] = train['version'].apply(rep) #统一labelencoder for fea in cate_features: all_df[fea]= all_df[fea].astype('float') all_df[fea] = le.fit_transform(all_df[fea]) features['lan'] = all_df[all_df['label'].notnull()]['lan'] features['fea_hash'] = all_df[all_df['label'].notnull()]['fea_hash'] train['lan'] = all_df[all_df['label'].notnull()]['lan'] train['fea_hash'] = all_df[all_df['label'].notnull()]['fea_hash'] for fea in cate_features: features[fea] = all_df[all_df['label'].notnull()][fea] train[fea] = all_df[all_df['label'].notnull()][fea] test_fea = test[features.columns] test_fea['lan'] = all_df[all_df['label'].isnull()]['lan'] test_fea['fea_hash'] = all_df[all_df['label'].isnull()]['fea_hash'] test_fea['version'] = test_fea['version'].apply(rep) for fea in cate_features: test_fea[fea] = all_df[all_df['label'].isnull()][fea] cate_features.append('lan') features['version']=features['version'].astype(float)
6.挑选出没有影响和影响最大的值,剔除或构造新特征
- 此处利用特征中label为1和label为0的比值来判断哪个特征更容易出现作弊点击的情况,此处设置的比值为7倍。
- 进一步利用筛选出的特征构造新特征,新特征用1表示此特征更易出现作弊情况
#测试出dev_ppi会使效果下降,放弃使用这个特征 f1 = features.drop(['dev_ppi'],axis=1) #挑选出影响最大的特征 selected_c = f1.columns def find_key_f(train,selected): temp0 = train[train['label']==0] temp = pd.DataFrame(columns=[0,1]) temp[0] = temp0[selected].value_counts()/len(temp0) *100 temp1 = train[train['label']==1] temp[1] = temp1[selected].value_counts()/len(temp0) *100 temp[2] = temp[1]/temp[0] result = temp[temp[2]>7].sort_values(2,ascending = False).index return result kf = {} for selected in selected_c: kf[selected] = find_key_f(train,selected) #挑选出影响最大的特征值 def ff(x,selected): if x in kf[selected]: return 1 else: return 0 for selected in selected_c: if len(kf[selected])>0: features[selected+'1'] = features[selected].apply(ff,args = (selected,)) test_fea[selected+'1'] = test_fea[selected].apply(ff,args = (selected,)) print(selected)
7.XGB 模型训练
- 使用五折交叉验证产生5个子模型,然后求平均得到结果
- 可以尝试不同的max_depth = [6, 9, 12, 15]
- n_estimators = [2000, 10000]
- 别忘记使用百度AIstudio提供的GPU进行加速 tree_method='gpu_hist',
- subsample、colsample_bytree 可以根据自己数据的特征数量进行选择
- 数据进行归一化在和距离无关的模型中不需要,比如这里的树模型
#五折交叉验证 import xgboost as xgb from sklearn.model_selection import StratifiedKFold,KFold from sklearn.metrics import accuracy_score def ensemble(clf, train_x, train_y, test, cate_features): prob = [] mean_acc = 0 sk = StratifiedKFold(n_splits=5,shuffle=True,random_state=2021) for k, (train_i, val_i) in enumerate(sk.split(train_x,train_y)): train_x_real = train_x.iloc[train_i] train_y_real = train_y.iloc[train_i] val_x = train_x.iloc[val_i] val_y = train_y.iloc[val_i] clf = clf.fit(train_x_real,train_y_real) val_y_pred = clf.predict(val_x) acc_val = accuracy_score(val_y,val_y_pred) print("第{}个子模型 acc={}".format(k+1,acc_val)) mean_acc += acc_val/5 test_y_pred = clf.predict_proba(test)[:-1] prob.append(test_y_pred) print(mean_acc) mean_prob = sum(prob) / 5 return mean_prob clf = xgb.XGBClassifier( max_depth=13, learning_rate=0.005, n_estimators=2400, objective='binary:logistic', tree_method='gpu_hist', subsample=0.95, colsample_bytree=0.4, min_child_samples=3, eval_metric='auc', reg_lambda=0.5, ) ensemble(clf, features, labels, test_fea, cate_features)
8.保存预测结果,方便后续投票
xgb_result = clf.predict_proba(test_fea) pd.DataFrame(xgb_result).to_csv('xgb_proba')
二、palm模型的数据预处理
import pandas as pd train= pd.read_csv('./train.csv',encoding='utf-8') test = pd.read_csv('./test1.csv',encoding='utf-8') sid = test.sid features = train.drop(['Unnamed: 0','label','os','sid'],axis=1) labels = train['label'] test = test[features.columns]
1.将时间戳转换为小时数并取整
from datetime import datetime as dt def get_date(features): features['timestamp'] = features['timestamp'].apply(lambda x: dt.fromtimestamp(x/1000)) start_time = features['timestamp'].min() features['time_diff'] = features['timestamp'] - start_time features['time_diff'] = features['time_diff'].dt.days*24 + features['time_diff'].dt.seconds/3600 features.drop(['timestamp'],axis=1,inplace = True) return features features = get_date(features) test = get_date(test)
#取整 features.time_diff = features.time_diff.astype(int) test.time_diff = test.time_diff.astype(int)
2.缺失值处理
这里使用了mode对osv进行处理,针对lan中的缺失值,由于lan是字符串的形式,所以我直接补充了nan作为特征,这是因为缺失值本身可能也会代表一些信息
features.loc[:,"osv"] = features.loc[:,"osv"].fillna(test.loc[:,"osv"].mode()[0]) features.loc[:,"lan"] = features.loc[:,"lan"].fillna('nan') test.loc[:,"osv"] = test.loc[:,"osv"].fillna(test.loc[:,"osv"].mode()[0]) test.loc[:,"lan"] = test.loc[:,"lan"].fillna('nan')
3.特征连接
这里我将将特征分为两类,一类是用户信息,一类是媒体信息,将他们的信息分别用空格连接起来变成两个句子,每个特征相当于句子中的一个词语,以用户和媒体信息之间的这种点击关系去做一个类似NLP中的问答任务,用户信息放在了text_a, 媒体信息放在了text_b
#连接函数 def sentence(row): return ' '.join([str(row[i]) for i in int_type]) def sentence1(row): return ' '.join([str(row[i]) for i in string_type])
#提取媒体信息和用户信息 string_type =['package','apptype','version','android_id','media_id'] int_type = [] for i in features.columns: if i not in string_type: int_type.append(i)
#写入palm的训练和预测数据 train_palm = pd.DataFrame() train_palm['label'] = train['label'] train_palm['text_a'] = features[int_type].apply(sentence,axis=1) train_palm['text_b'] = features[string_type].apply(sentence1,axis=1) test_palm = pd.DataFrame() test_palm['label'] = test.apptype #label不能为空,可以随便填一个 test_palm['text_a'] = test[int_type].apply(sentence,axis=1) test_palm['text_b'] = test[string_type].apply(sentence1,axis=1)
#保存palm所需的数据 train_palm.to_csv('./train_palm.csv', sep='\t', index=False) test_palm.to_csv('./test_palm.csv', sep='\t', index=False)
4.palm模型搭建与训练
!pip install paddlepalm
查看并下载预训练模型
from paddlepalm import downloader downloader.ls('pretrain')
downloader.download('pretrain', 'ERNIE-v2-en-base', './pretrain_models')
import paddle import json import paddlepalm
设置palm参数,开始训练
此处的参数参考了 PaddlePALM样例: Quora问题相似度匹配 ,修改了学习率,epoch,drop率等等,大家可以自己进行调整
max_seqlen = 128 batch_size = 16 num_epochs = 30 lr = 1e-6 weight_decay = 0.0001 num_classes = 2 random_seed = 1 dropout_prob = 0.002 save_path = './outputs/' save_type = 'ckpt' pred_model_path = './outputs/ckpt.step15000' print_steps = 1000 pred_output = './outputs/predict/' pre_params = '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/params' task_name = 'Quora Question Pairs matching' vocab_path = '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/vocab.txt' train_file = '/home/aistudio/train_palm.csv' predict_file = '/home/aistudio/test_palm.csv' config = json.load(open('/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] paddle.enable_static()
match_reader = paddlepalm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed) # step 1-2: load the training data match_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size) # step 2: create a backbone of the model to extract text features ernie = paddlepalm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader match_reader.register_with(ernie) # step 4: create the task output head match_head = paddlepalm.head.Match(num_classes, input_dim, dropout_prob) # step 5-1: create a task trainer trainer = paddlepalm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, match_head) # step 6-1*: use warmup n_steps = match_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) sched = paddlepalm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = paddlepalm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data trainer.fit_reader(match_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params, False) # step 8-2*: set saver to save model save_steps = 15000 trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type) # step 8-3: start training trainer.train(print_steps=print_steps) # 预测部分代码,假设训练保存模型为./outputs/training_pred_model: print('prepare to predict...')
经过验证,使用从预训练模型训练到480000step的参数预测表现较好
vocab_path = '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/vocab.txt' predict_match_reader = paddlepalm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load the training data predict_match_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = paddlepalm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_match_reader.register_with(pred_ernie) # step 4: create the task output head match_pred_head = paddlepalm.head.Match(num_classes, input_dim, phase='predict') predicter=paddlepalm.Trainer(task_name) # step 5: build forward graph with backbone and task head predicter.build_predict_forward(pred_ernie, match_pred_head) pred_model_path ='./outputs/ckpt.step480000' # step 6: load pretrained model pred_ckpt = predicter.load_ckpt(pred_model_path) # step 7: fit prepared reader and data predicter.fit_reader(predict_match_reader, phase='predict') # step 8: predict print('predicting..') predicter.predict(print_steps=print_steps, output_dir=pred_output)
5.读取palm预测结果
palm_proba = pd.read_json('./outputs/predict/predictions.json',lines=True)
五、模型结果融合·
1.获取palm预测中为欺诈点击的概率
palm_res = palm_proba.probs.apply(lambda x: x[1])
2.读取xgboost的预测概率
xgb_result = pd.read_csv('./xgb_proba.csv',encoding='utf-8')
3.将XGBoost和palm的预测结果相加,用1作为阀值投票
palm_label = palm_proba.label vote = xgb_result['1'] + palm_res vote = pd.DataFrame(vote) result = vote[0].apply(lambda x:1 if x>=1 else 0)
0 0.119056 1 1.440423 2 0.080029 3 0.048663 4 1.920310 ... 149995 1.868549 149996 1.923986 149997 1.897405 149998 1.939416 149999 1.930932 Length: 150000, dtype: float64
4.最终结果保存
a = pd.DataFrame(sid) a['label']= result a.to_csv('composition.csv',index = False)
六、总结+改进完善方向
- 此模型还有较大的上升空间,大家可以在特征处理上进一步完善,比如之前提到的基于fea_hash等特征构建新特征
- 在比赛中要结合业务场景进行思考,提出假设并验证假设的正确性
- 可以抽象出比赛内容所要求的任务基本模式,尝试着用其他领域相关模式的模型进行解决,在这里,大家可以尝试其他百度提供的NLP模型,也可以试着使用其他领域的模型
- 在特征工程、数据处理和调参的过程中做好备份与记录,方便查看、回溯和管理版本信息,能提高大家建模的效率
- 最后祝大家在AI路上越走越好,也希望百度AI studio能够快速发展