一、返乡发展人群预测
1.简介
基于中国联通的大数据能力,通过使用对联通的信令数据、通话数据、互联网行为等数据进行建模,对个人是否会返乡工作进行判断。
地址: www.datafountain.cn/competition…
2. 数据简介
train.csv:包含全量数据集的70%(dataNoLabel是训练集的一部分,选手可以自己决定是否使用)
test.csv:包含全量数据集的30%
位置类特特征:基于联通基站产生的用户信令数据;
互联网类特征:基于联通用户上网产生的上网行为数据;
通话类特征:基于联通用户日常通话、短信产生的数据。
该赛题数据集为根据真实数据模拟的数据,非真实生产数据。
序号 | 特征名称 | 特征解释 | 序号 | 特征名称 | 特征解释 |
1 | id | 唯一标识 | 25 | f24 | 互联网类特征19 |
2 | f1 | 位置类特征01 | 26 | f25 | 互联网类特征20 |
3 | f2 | 位置类特征02 | 27 | f26 | 互联网类特征21 |
4 | f4 | 位置类特征04 | 28 | f27 | 互联网类特征22 |
5 | f5 | 位置类特征05 | 29 | f28 | 互联网类特征23 |
6 | f6 | 位置类特征06 | 30 | f29 | 互联网类特征24 |
7 | f3 | 互联网类特征01 | 31 | f30 | 互联网类特征25 |
8 | f7 | 互联网类特征02 | 32 | f31 | 互联网类特征26 |
9 | f8 | 互联网类特征03 | 33 | f32 | 互联网类特征27 |
10 | f9 | 互联网类特征04 | 34 | f33 | 互联网类特征28 |
11 | f10 | 互联网类特征05 | 35 | f34 | 互联网类特征29 |
12 | f11 | 互联网类特征06 | 36 | f35 | 互联网类特征30 |
13 | f12 | 互联网类特征07 | 37 | f36 | 互联网类特征31 |
14 | f13 | 互联网类特征08 | 38 | f37 | 互联网类特征32 |
15 | f14 | 互联网类特征09 | 39 | f38 | 互联网类特征33 |
16 | f15 | 互联网类特征10 | 40 | f39 | 互联网类特征34 |
17 | f16 | 互联网类特征11 | 41 | f40 | 互联网类特征35 |
18 | f17 | 互联网类特征12 | 42 | f41 | 互联网类特征36 |
19 | f18 | 互联网类特征13 | 43 | f42 | 互联网类特征37 |
20 | f19 | 互联网类特征14 | 44 | f43 | 通话类特征01 |
21 | f20 | 互联网类特征15 | 45 | f44 | 通话类特征02 |
22 | f21 | 互联网类特征16 | 46 | f45 | 通话类特征03 |
23 | f22 | 互联网类特征17 | 47 | f46 | 通话类特征04 |
24 | f23 | 互联网类特征18 | 48 | label | Y标签 |
二、数据处理
1.数据读取
!unzip -qoa -O GBK data/data169716/返乡发展人群预测.zip
!ls 返乡发展人群预测
dataA.csv dataNoLabel.csv dataTrain.csv submit_example_A.csv
import pandas as pd import numpy as np import lightgbm as lgb from sklearn.model_selection import train_test_split
path = '返乡发展人群预测/' train = pd.read_csv(path+'dataTrain.csv') no_label = pd.read_csv(path+'dataNoLabel.csv') A = pd.read_csv(path+'dataA.csv')
train.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | ... | f38 | f39 | f40 | f41 | f42 | f43 | f44 | f45 | f46 | label | |
0 | 81167 | 0 | 1 | mid | 0 | 0 | 0 | 153 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 624 | 1539 | 0 |
1 | 50408 | 1 | 1 | mid | 0 | 0 | 21 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 186 | 366 | 0 |
2 | 9114 | 0 | 0 | high | 36 | 36 | 120 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 48 | 1 |
3 | 53228 | 1 | 1 | low | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 9 | 0 |
4 | 56280 | 1 | 1 | mid | 9 | 51 | 294 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 141 | 0 |
5 rows × 48 columns
no_label.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | ... | f37 | f38 | f39 | f40 | f41 | f42 | f43 | f44 | f45 | f46 | |
0 | 137259 | 0 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 441 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 153 | 504 |
1 | 98870 | 1 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 333 | 0 | 0 | 0 | 0 | 0 | 18 | 63 | 657 | 1377 |
2 | 128313 | 0 | 1 | low | 0 | 0 | 0 | 0 | 0 | 0 | ... | 306 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 54 | 81 |
3 | 138249 | 1 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 234 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 27 |
4 | 150656 | 1 | 1 | low | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 234 | 729 |
5 rows × 47 columns
A.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | ... | f37 | f38 | f39 | f40 | f41 | f42 | f43 | f44 | f45 | f46 | |
0 | 1 | 1 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 147 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 51 | 168 |
1 | 2 | 0 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 111 | 0 | 0 | 0 | 0 | 0 | 6 | 21 | 219 | 459 |
2 | 3 | 0 | 1 | high | 0 | 0 | 0 | 0 | 0 | 0 | ... | 102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 27 |
3 | 4 | 0 | 1 | mid | 0 | 0 | 0 | 0 | 0 | 0 | ... | 78 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 9 |
4 | 5 | 1 | 1 | low | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 78 | 243 |
5 rows × 47 columns
2.数据分布
print(train['label'].value_counts()) train['label'].value_counts().plot(kind='bar')
0 44950 1 14922 Name: label, dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x7f59ee613090>
3.数据合并
data=pd.concat([train, A],axis=0).reset_index(drop=True)
data=data.fillna('NAN')
三、特征处理
1.lightgbm安装
!pip install -U -q lightgbm
!pip list |grep lightgbm
lightgbm 3.3.2
2.LabelEncoder 处理
将离散型的数据转换成 0 到 n − 1 之间的数,这里 n 是一个列表的不同取值的个数,可以认为是某个特征的所有不同取值的个数。
- 提示:
- LabelEncoder().fit_transform函数 argument must be a string or number:
- LabelEncoder()运行前要对缺失值进行处理
主要是两个步骤
- 先构造encoder,通过fit函数传入需要编码的数据,在内部生成对应的key-value
- 然后encoder 用于需要转化的数据,用transform函数
from tqdm import tqdm from sklearn.preprocessing import LabelEncoder
lbls={} features=data.columns[1:-1] print(len(features)) for col in tqdm(features): lbl=LabelEncoder() lbl.fit(data[col]) data[col]=lbl.transform(data[col])
67%|██████▋ | 31/46 [00:00<00:00, 153.29it/s] 46 100%|██████████| 46/46 [00:00<00:00, 150.71it/s]
train, test = data[:len(train)], data[len(train):]
四、模型训练
from sklearn.model_selection import StratifiedKFold from lightgbm import early_stopping from lightgbm import log_evaluation # 要升级后,重启notebook
# 排除特征 # id等肯定是要排除的 feature_names = list( filter( lambda x: x not in ['id','label'], train.columns)) # label转为int类型 train['label']=train['label'].apply(lambda i:int(i))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
def lgb_model(train, target, test, k): feats = [f for f in train.columns if f not in ['lable', 'url', 'url_count']] print('Current num of features:', len(feats)) oof_probs = np.zeros((train.shape[0],2)) output_preds = 0 offline_score = [] feature_importance_df = pd.DataFrame() parameters = { 'learning_rate': 0.03, 'boosting_type': 'gbdt', 'objective': 'multiclass', 'metric': 'multi_error', 'num_class': 2, 'num_leaves': 31, 'feature_fraction': 0.6, 'bagging_fraction': 0.8, 'min_data_in_leaf': 15, 'verbose': -1, 'nthread': 4, 'max_depth': 7 } seeds = [2020] for seed in seeds: folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed) for i, (train_index, test_index) in enumerate(folds.split(train, target)): train_y, test_y = target.iloc[train_index], target.iloc[test_index] train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :] dtrain = lgb.Dataset(train_X, label=train_y) dval = lgb.Dataset(test_X, label=test_y) lgb_model = lgb.train( parameters, dtrain, num_boost_round=20000, valid_sets=[dval], callbacks=[early_stopping(100), log_evaluation(100)], ) oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len( seeds) offline_score.append(lgb_model.best_score['valid_0']['multi_error']) output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration) / folds.n_splits / len(seeds) print(offline_score) # feature importance fold_importance_df = pd.DataFrame() fold_importance_df["feature"] = feats fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain') fold_importance_df["fold"] = i + 1 feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0) print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score))) print('feature importance:') print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(50)) return output_preds, oof_probs, np.mean(offline_score), feature_importance_df
print('开始模型训练train') lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names], target=train['label'], test=test[feature_names], k=10)
开始模型训练train Current num of features: 46 Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.155311 [200] valid_0's multi_error: 0.154309 Early stopping, best iteration is: [169] valid_0's multi_error: 0.153474 [0.15347361389445557] Training until validation scores don't improve for 100 rounds
五、预测
# 读取提交格式 example_A = pd.read_csv(path+'submit_example_A.csv')
example_A.head()
# 获取最大概率标签 example_A['label']=np.argmax(lgb_preds,axis=1)
example_A['label'].value_counts()
# 保存 example_A.to_csv('sub.csv',index=None)
from sklearn.metrics import accuracy_score accuracy_score(train['label'],np.argmax(lgb_oof,axis=1))
六、学习参考
- LightGBM项目地址:github.com/microsoft/L…
- LightGBM官方文档地址: lightgbm.readthedocs.io/en/latest/