集成学习案例-幸福感预测
本次案例来源于天池的一个比赛,赛题使用 139 维的特征,使用 8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代
表幸福感最高)。以均方误差MSE为评价标准,因为评价标准为均方误差,所以使用回归问题的思路解决该问.
Blending集成学习方式:
(1) 将数据划分为训练集和测试集(test_set),其中训练集需要再次划分为训练集(train_set)和验证集(val_set);
(2) 创建第一层的多个模型,这些模型可以使同质的也可以是异质的;
(3) 使用train_set训练步骤2中的多个模型,然后用训练好的模型预测val_set和test_set得到val_predict, test_predict1;
(4) 创建第二层的模型,使用val_predict作为训练集训练第二层的模型;
(5) 使用第二层训练好的模型对第二层测试集test_predict1进行预测,该结果为整个测试集的结果。
**# 导入相关库** import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use("ggplot") %matplotlib inline import seaborn as sns **# 引入数据** from sklearn import datasets from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 ) **## 创建训练集和测试集** X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1) **## 创建训练集和验证集** X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1) print("The shape of training X:",X_train.shape) print("The shape of training y:",y_train.shape) print("The shape of test X:",X_test.shape) print("The shape of test y:",y_test.shape) print("The shape of validation X:",X_val.shape) print("The shape of validation y:", 2.导入数据集 train = pd.read_csv("train.csv", parse_dates=['survey_time'],encoding='latin-1') #将时间字符串格式转换为日期格式, test = pd.read_csv("test.csv", parse_dates=['survey_time'],encoding='latin-1') #latin-1向下兼容ASCII #latin-1字符集是在ascii码上的一个扩展,它把ascii码没有用到过的字节码都给编上了对应的字符,能表示更多的字符 train = train[train["happiness"]!=-8].reset_index(drop=True) train_data_copy = train.copy() #删去"happiness" 为-8的行 target_col = "happiness" #目标列 target = train_data_copy[target_col] del train_data_copy[target_col] #去除目标列 data = pd.concat([train_data_copy,test],axis=0,ignore_index=True) 3. 查看数据的基本信息 train.happiness.describe() #数据的基本信息 count 7988.000000 mean 3.867927 std 0.818717 min 1.000000 25% 4.000000 50% 4.000000 75% 4.000000 max 5.000000 Name: happiness, dtype: float64 4.数据预处理 首先对于数据中的连续出现的负数值进行处理。由于数据中的负数值只有-1,-2,-3,-8这几种数值,所以它们进行分别的操作,实现代码如下: #make feature +5 #csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去 def getres1(row): return len([x for x in row.values if type(x)==int and x<0]) def getres2(row): return len([x for x in row.values if type(x)==int and x==-8]) def getres3(row): return len([x for x in row.values if type(x)==int and x==-1]) def getres4(row): return len([x for x in row.values if type(x)==int and x==-2]) def getres5(row): return len([x for x in row.values if type(x)==int and x==-3]) #检查数据 data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1) data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理 data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1) data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1) data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1) data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis= 填充缺失值,在这里采取的方式是将缺失值补全,使用fillna(value),其中value的数值根据具体的情况来确定。例如将大部分缺失信息认为是零,将家庭成员数认为是1,将家庭收入这个特征认为是66365,即所有家庭的收入平均值。部分实现代码如下: #填充缺失值 共25列 去掉4列 填充21列 #以下的列都是缺省的,视情况填补 data['work_status'] = data['work_status'].fillna(0) data['work_yr'] = data['work_yr'].fillna(0) data['work_manage'] = data['work_manage'].fillna(0) data['work_type'] = data['work_type'].fillna(0) data['edu_yr'] = data['edu_yr'].fillna(0) data['edu_status'] = data['edu_status'].fillna(0) data['s_work_type'] = data['s_work_type'].fillna(0) data['s_work_status'] = data['s_work_status'].fillna(0) data['s_political'] = data['s_political'].fillna(0) data['s_hukou'] = data['s_hukou'].fillna(0) data['s_income'] = data['s_income'].fillna(0) data['s_birth'] = data['s_birth'].fillna(0) data['s_edu'] = data['s_edu'].fillna(0) data['s_work_exper'] = data['s_work_exper'].fillna(0) data['minor_child'] = data['minor_child'].fillna(0) data['marital_now'] = data['marital_now'].fillna(0) data['marital_1st'] = data['marital_1st'].fillna(0) data['social_neighbor']=data['social_neighbor'].fillna(0) data['social_friend']=data['social_friend'].fillna(0) data['hukou_loc']=data['hukou_loc'].fillna(1) #最少为1,表示户口 data['family_income']=data['family_income'].fillna(66365) #删除问题值后的平均值 除此之外,还有特殊格式的信息需要另外处理,比如与时间有关的信息,这里主要分为两部分进行处理:首先是将“连续”的年龄,进行分层处理,即划分年龄段,这里将年龄分为了6个区间。其次是计算具体的年龄,在Excel表格中,只有出生年月以及调查时间等信息,我们根据此计算出每一位调查者的真实年龄。具体实现代码如下: #144+1 =145 #继续进行特殊的列进行数据处理 #读happiness_index.xlsx data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d',errors='coerce')#防止时间格式不同的报错errors='coerce‘ data['survey_time'] = data['survey_time'].dt.year #仅仅是year,方便计算年龄 data['age'] = data['survey_time']-data['birth'] # print(data['age'],data['survey_time'],data['birth']) #年龄分层 145+1=146 bins = [0,17,26,34,50,63,100] data['age_bin'] = pd.cut(data['age'], bins, labels=[0,1,2,3,4,5]) 在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。 #对‘宗教’处理 data.loc[data['religion']<0,'religion'] = 1 #1为不信仰宗教 data.loc[data['religion_freq']<0,'religion_freq'] = 1 #1为从来没有参加过 #对‘教育程度’处理 data.loc[data['edu']<0,'edu'] = 4 #初中 data.loc[data['edu_status']<0,'edu_status'] = 0 data.loc[data['edu_yr']<0,'edu_yr'] = 0 #对‘个人收入’处理 data.loc[data['income']<0,'income'] = 0 #认为无收入 #对‘政治面貌’处理 data.loc[data['political']<0,'political'] = 1 #认为是群众 #对体重处理 data.loc[(data['weight_jin']<=80)&(data['height_cm']>=160),'weight_jin']= data['weight_jin']*2 data.loc[data['weight_jin']<=60,'weight_jin']= data['weight_jin']*2 #个人的想法,哈哈哈,没有60斤的成年人吧 #对身高处理 data.loc[data['height_cm']<150,'height_cm'] = 150 #成年人的实际情况 #对‘健康’处理 data.loc[data['health']<0,'health'] = 4 #认为是比较健康 data.loc[data['health_problem']<0,'health_problem'] = 4 #对‘沮丧’处理 data.loc[data['depression']<0,'depression'] = 4 #一般人都是很少吧 #对‘媒体’处理 data.loc[data['media_1']<0,'media_1'] = 1 #都是从不 data.loc[data['media_2']<0,'media_2'] = 1 data.loc[data['media_3']<0,'media_3'] = 1 data.loc[data['media_4']<0,'media_4'] = 1 data.loc[data['media_5']<0,'media_5'] = 1 data.loc[data['media_6']<0,'media_6'] = 1 #对‘空闲活动’处理 data.loc[data['leisure_1']<0,'leisure_1'] = 1 #都是根据自己的想法 data.loc[data['leisure_2']<0,'leisure_2'] = 5 data.loc[data['leisure_3']<0,'leisure_3'] = 3 使用众数(代码中使用mode()来实现异常值的修正),由于这里的特征是空闲活动,所以采用众数对于缺失值进行处理比较合理。具体的代码参考如下: data.loc[data['leisure_4']<0,'leisure_4'] = data['leisure_4'].mode() #取众数 data.loc[data['leisure_5']<0,'leisure_5'] = data['leisure_5'].mode() data.loc[data['leisure_6']<0,'leisure_6'] = data['leisure_6'].mode() data.loc[data['leisure_7']<0,'leisure_7'] = data['leisure_7'].mode() data.loc[data['leisure_8']<0,'leisure_8'] = data['leisure_8'].mode() data.loc[data['leisure_9']<0,'leisure_9'] = data['leisure_9'].mode() data.loc[data['leisure_10']<0,'leisure_10'] = data['leisure_10'].mode() data.loc[data['leisure_11']<0,'leisure_11'] = data['leisure_11'].mode() data.loc[data['leisure_12']<0,'leisure_12'] = data['leisure_12'].mode() data.loc[data['socialize']<0,'socialize'] = 2 #很少 data.loc[data['relax']<0,'relax'] = 4 #经常 data.loc[data['learn']<0,'learn'] = 1 #从不,哈哈哈哈 #对‘社交’处理 data.loc[data['social_neighbor']<0,'social_neighbor'] = 0 data.loc[data['social_friend']<0,'social_friend'] = 0 data.loc[data['socia_outing']<0,'socia_outing'] = 1 data.loc[data['neighbor_familiarity']<0,'social_neighbor']= 4 #对‘社会公平性’处理 data.loc[data['equity']<0,'equity'] = 4 #对‘社会等级’处理 data.loc[data['class_10_before']<0,'class_10_before'] = 3 data.loc[data['class']<0,'class'] = 5 data.loc[data['class_10_after']<0,'class_10_after'] = 5 data.loc[data['class_14']<0,'class_14'] = 2 #对‘工作情况’处理 data.loc[data['work_status']<0,'work_status'] = 0 data.loc[data['work_yr']<0,'work_yr'] = 0 data.loc[data['work_manage']<0,'work_manage'] = 0 data.loc[data['work_type']<0,'work_type'] = 0 #对‘社会保障’处理 data.loc[data['insur_1']<0,'insur_1'] = 1 data.loc[data['insur_2']<0,'insur_2'] = 1 data.loc[data['insur_3']<0,'insur_3'] = 1 data.loc[data['insur_4']<0,'insur_4'] = 1 data.loc[data['insur_1']==0,'insur_1'] = 0 data.loc[data['insur_2']==0,'insur_2'] = 0 data.loc[data['insur_3']==0,'insur_3'] = 0 data.loc[data['insur_4']==0,'insur_4'] = 0 取均值进行缺失值的补全(代码实现为means()),在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。具体的代码参考如下: #对家庭情况处理 family_income_mean = data['family_income'].mean() data.loc[data['family_income']<0,'family_income'] = family_income_mean data.loc[data['family_m']<0,'family_m'] = 2 data.loc[data['family_status']<0,'family_status'] = 3 data.loc[data['house']<0,'house'] = 1 data.loc[data['car']<0,'car'] = 0 data.loc[data['car']==2,'car'] = 0 #变为0和1 data.loc[data['son']<0,'son'] = 1 data.loc[data['daughter']<0,'daughter'] = 0 data.loc[data['minor_child']<0,'minor_child'] = 0 #对‘婚姻’处理 data.loc[data['marital_1st']<0,'marital_1st'] = 0 data.loc[data['marital_now']<0,'marital_now'] = 0 #对‘配偶’处理 data.loc[data['s_birth']<0,'s_birth'] = 0 data.loc[data['s_edu']<0,'s_edu'] = 0 data.loc[data['s_political']<0,'s_political'] = 0 data.loc[data['s_hukou']<0,'s_hukou'] = 0 data.loc[data['s_income']<0,'s_income'] = 0 data.loc[data['s_work_type']<0,'s_work_type'] = 0 data.loc[data['s_work_status']<0,'s_work_status'] = 0 data.loc[data['s_work_exper']<0,'s_work_exper'] = 0 #对‘父母情况’处理 data.loc[data['f_birth']<0,'f_birth'] = 1945 data.loc[data['f_edu']<0,'f_edu'] = 1 data.loc[data['f_political']<0,'f_political'] = 1 data.loc[data['f_work_14']<0,'f_work_14'] = 2 data.loc[data['m_birth']<0,'m_birth'] = 1940 data.loc[data['m_edu']<0,'m_edu'] = 1 data.loc[data['m_political']<0,'m_political'] = 1 data.loc[data['m_work_14']<0,'m_work_14'] = 2 #和同龄人相比社会经济地位 data.loc[data['status_peer']<0,'status_peer'] = 2 #和3年前比社会经济地位 data.loc[data['status_3_before']<0,'status_3_before'] = 2 #对‘观点’处理 data.loc[data['view']<0,'view'] = 4 #对期望年收入处理 data.loc[data['inc_ability']<=0,'inc_ability']= 2 inc_exp_mean = data['inc_exp'].mean() data.loc[data['inc_exp']<=0,'inc_exp']= inc_exp_mean #取均值 #部分特征处理,取众数(首先去除缺失值的数据) for i in range(1,9+1): data.loc[data['public_service_'+str(i)]<0,'public_service_'+str(i)] = data['public_service_'+str(i)].dropna().mode().values for i in range(1,13+1): data.loc[data['trust_'+str(i)]<0,'trust_'+str(i)] = data['trust_'+str(i)].dropna().mode().values 5.数据增广 进一步分析每一个特征之间的关系,从而进行数据增广。经过思考,这里添加了如下的特征:第一次结婚年龄、最近结婚年龄、是否再婚、配偶年龄、配偶年龄差、各种收入比(与配偶之间的收入比、十年后预期收入与现在收入之比等等)、收入与住房面积比(其中也包括10年后期望收入等等各种情况)、社会阶级(10年后的社会阶级、14年后的社会阶级等等)、悠闲指数、满意指数、信任指数等等。除此之外,还考虑了对于同一省、市、县进行了归一化。例如同一省市内的收入的平均值等以及一个个体相对于同省、市、县其他人的各个指标的情况。同时也考虑了对于同龄人之间的相互比较,即在同龄人中的收入情况、健康情况等等。具体的实现代码如下: ```python ```python #第一次结婚年龄 147 data['marital_1stbir'] = data['marital_1st'] - data['birth'] #最近结婚年龄 148 data['marital_nowtbir'] = data['marital_now'] - data['birth'] #是否再婚 149 data['mar'] = data['marital_nowtbir'] - data['marital_1stbir'] #配偶年龄 150 data['marital_sbir'] = data['marital_now']-data['s_birth'] #配偶年龄差 151 data['age_'] = data['marital_nowtbir'] - data['marital_sbir'] #收入比 151+7 =158 data['income/s_income'] = data['income']/(data['s_income']+1) #同居伴侣 data['income+s_income'] = data['income']+(data['s_income']+1) data['income/family_income'] = data['income']/(data['family_income']+1) data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1) data['income/inc_exp'] = data['income']/(data['inc_exp']+1) data['family_income/m'] = data['family_income']/(data['family_m']+0.01) data['income/m'] = data['income']/(data['family_m']+0.01) #收入/面积比 158+4=162 data['income/floor_area'] = data['income']/(data['floor_area']+0.01) data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01) data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01) data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01) #class 162+3=165 data['class_10_diff'] = (data['class_10_after'] - data['class']) data['class_diff'] = data['class'] - data['class_10_before'] data['class_14_diff'] = data['class'] - data['class_14'] #悠闲指数 166 leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)] data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew #满意指数 167 public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)] data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew #信任指数 168 trust_fea_lis = ['trust_'+str(i) for i in range(1,14)] data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew #province mean 168+13=181 data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values data['province_public_service_sum_mean'] = data.groupby(['province'])['public_service_sum'].transform('mean').values data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values #city mean 181+13=194 data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values #按照city分组 data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values #county mean 194 + 13 = 207 data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values data['county_public_service_sum_mean'] = data.groupby(['county'])['public_service_sum'].transform('mean').values data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values #ratio 相比同省 207 + 13 =220 data['income/province'] = data['income']/(data['province_income_mean']) data['family_income/province'] = data['family_income']/(data['province_family_income_mean']) data['equity/province'] = data['equity']/(data['province_equity_mean']) data['depression/province'] = data['depression']/(data['province_depression_mean']) data['floor_area/province'] = data['floor_area']/(data['province_floor_area_mean']) data['health/province'] = data['health']/(data['province_health_mean']) data['class_10_diff/province'] = data['class_10_diff']/(data['province_class_10_diff_mean']) data['class/province'] = data['class']/(data['province_class_mean']) data['health_problem/province'] = data['health_problem']/(data['province_health_problem_mean']) data['family_status/province'] = data['family_status']/(data['province_family_status_mean']) data['leisure_sum/province'] = data['leisure_sum']/(data['province_leisure_sum_mean']) data['public_service_sum/province'] = data['public_service_sum']/(data['province_public_service_sum_mean']) data['trust_sum/province'] = data['trust_sum']/(data['province_trust_sum_mean']+1) #ratio 相比同市 220 + 13 =233 data['income/city'] = data['income']/(data['city_income_mean']) data['family_income/city'] = data['family_income']/(data['city_family_income_mean']) data['equity/city'] = data['equity']/(data['city_equity_mean']) data['depression/city'] = data['depression']/(data['city_depression_mean']) data['floor_area/city'] = data['floor_area']/(data['city_floor_area_mean']) data['health/city'] = data['health']/(data['city_health_mean']) data['class_10_diff/city'] = data['class_10_diff']/(data['city_class_10_diff_mean']) data['class/city'] = data['class']/(data['city_class_mean']) data['health_problem/city'] = data['health_problem']/(data['city_health_problem_mean']) data['family_status/city'] = data['family_status']/(data['city_family_status_mean']) data['leisure_sum/city'] = data['leisure_sum']/(data['city_leisure_sum_mean']) data['public_service_sum/city'] = data['public_service_sum']/(data['city_public_service_sum_mean']) data['trust_sum/city'] = data['trust_sum']/(data['city_trust_sum_mean']) #ratio 相比同个地区 233 + 13 =246 data['income/county'] = data['income']/(data['county_income_mean']) data['family_income/county'] = data['family_income']/(data['county_family_income_mean']) data['equity/county'] = data['equity']/(data['county_equity_mean']) data['depression/county'] = data['depression']/(data['county_depression_mean']) data['floor_area/county'] = data['floor_area']/(data['county_floor_area_mean']) data['health/county'] = data['health']/(data['county_health_mean']) data['class_10_diff/county'] = data['class_10_diff']/(data['county_class_10_diff_mean']) data['class/county'] = data['class']/(data['county_class_mean']) data['health_problem/county'] = data['health_problem']/(data['county_health_problem_mean']) data['family_status/county'] = data['family_status']/(data['county_family_status_mean']) data['leisure_sum/county'] = data['leisure_sum']/(data['county_leisure_sum_mean']) data['public_service_sum/county'] = data['public_service_sum']/(data['county_public_service_sum_mean']) data['trust_sum/county'] = data['trust_sum']/(data['county_trust_sum_mean']) #age mean 246+ 13 =259 data['age_income_mean'] = data.groupby(['age'])['income'].transform('mean').values data['age_family_income_mean'] = data.groupby(['age'])['family_income'].transform('mean').values data['age_equity_mean'] = data.groupby(['age'])['equity'].transform('mean').values data['age_depression_mean'] = data.groupby(['age'])['depression'].transform('mean').values data['age_floor_area_mean'] = data.groupby(['age'])['floor_area'].transform('mean').values data['age_health_mean'] = data.groupby(['age'])['health'].transform('mean').values data['age_class_10_diff_mean'] = data.groupby(['age'])['class_10_diff'].transform('mean').values data['age_class_mean'] = data.groupby(['age'])['class'].transform('mean').values data['age_health_problem_mean'] = data.groupby(['age'])['health_problem'].transform('mean').values data['age_family_status_mean'] = data.groupby(['age'])['family_status'].transform('mean').values data['age_leisure_sum_mean'] = data.groupby(['age'])['leisure_sum'].transform('mean').values data['age_public_service_sum_mean'] = data.groupby(['age'])['public_service_sum'].transform('mean').values data['age_trust_sum_mean'] = data.groupby(['age'])['trust_sum'].transform('mean').values # 和同龄人相比259 + 13 =272 data['income/age'] = data['income']/(data['age_income_mean']) data['family_income/age'] = data['family_income']/(data['age_family_income_mean']) data['equity/age'] = data['equity']/(data['age_equity_mean']) data['depression/age'] = data['depression']/(data['age_depression_mean']) data['floor_area/age'] = data['floor_area']/(data['age_floor_area_mean']) data['health/age'] = data['health']/(data['age_health_mean']) data['class_10_diff/age'] = data['class_10_diff']/(data['age_class_10_diff_mean']) data['class/age'] = data['class']/(data['age_class_mean']) data['health_problem/age'] = data['health_problem']/(data['age_health_problem_mean']) data['family_status/age'] = data['family_status']/(data['age_family_status_mean']) data['leisure_sum/age'] = data['leisure_sum']/(data['age_leisure_sum_mean']) data['public_service_sum/age'] = data['public_service_sum']/(data['age_public_service_sum_mean']) data['trust_sum/age'] = data['trust_sum']/(data['age_trust_sum_mean']) 经过如上的操作后,最终特征从一开始的131维,扩充为了272维的特征。接下来考虑特征工程、训练模型以及模型融合的工作。 print('shape',data.shape) data.head() shape (10956, 272) id survey_type province city county survey_time gender birth nationality religion ... depression/age floor_area/age health/age class_10_diff/age class/age health_problem/age family_status/age leisure_sum/age public_service_sum/age trust_sum/age 0 1 1 12 32 59 2015 1 1959 1 1 ... 1.285211 0.410351 0.848837 0.000000 0.683307 0.521429 0.733668 0.724620 0.666638 0.925941 1 2 2 18 52 85 2015 1 1992 1 1 ... 0.733333 0.952824 1.179337 1.012552 1.344444 0.891344 1.359551 1.011792 1.130778 1.188442 2 3 2 29 83 126 2015 2 1967 1 0 ... 1.343537 0.972328 1.150485 1.190955 1.195762 1.055679 1.190955 0.966470 1.193204 0.803693 3 4 2 10 28 51 2015 2 1943 1 1 ... 1.111663 0.642329 1.276353 4.977778 1.199143 1.188329 1.162630 0.899346 1.153810 1.300950 4 5 1 7 18 36 2015 2 1994 1 1 ... 0.750000 0.587284 1.177106 0.000000 0.236957 1.116803 1.093645 1.045313 0.728161 1.117428 5 rows × 272 columns 删去有效样本数很少的特征,例如负值太多的特征或者是缺失值太多的特征,这里一共删除了包括“目前的最高教育程度”在内的9类特征,得到了最终的263维的特征 #272-9=263 #删除数值特别少的和之前用过的特征 del_list=['id','survey_time','edu_other','invest_other','property_other','join_party','province','city','county'] use_feature = [clo for clo in data.columns if clo not in del_list] data.fillna(0,inplace=True) #还是补0 train_shape = train.shape[0] #一共的数据量,训练集 features = data[use_feature].columns #删除后所有的特征 X_train_263 = data[:train_shape][use_feature].values y_train = target X_test_263 = data[train_shape:][use_feature].values X_train_263.shape #最终一种263个特征 (7988, 263) 这里选择了最重要的49个特征,作为除了以上263维特征外的另外一组特征 imp_fea_49 = ['equity','depression','health','class','family_status','health_problem','class_10_after', 'equity/province','equity/city','equity/county', 'depression/province','depression/city','depression/county', 'health/province','health/city','health/county', 'class/province','class/city','class/county', 'family_status/province','family_status/city','family_status/county', 'family_income/province','family_income/city','family_income/county', 'floor_area/province','floor_area/city','floor_area/county', 'leisure_sum/province','leisure_sum/city','leisure_sum/county', 'public_service_sum/province','public_service_sum/city','public_service_sum/county', 'trust_sum/province','trust_sum/city','trust_sum/county', 'income/m','public_service_sum','class_diff','status_3_before','age_income_mean','age_floor_area_mean', 'weight_jin','height_cm', 'health/age','depression/age','equity/age','leisure_sum/age' ] train_shape = train.shape[0] X_train_49 = data[:train_shape][imp_fea_49].values X_test_49 = data[train_shape:][imp_fea_49].values X_train_49.shape #最重要的49个特征 选择需要进行onehot编码的离散变量进行one-hot编码,再合成为第三类特征,共383维。 cat_fea = ['survey_type','gender','nationality','edu_status','political','hukou','hukou_loc','work_exper','work_status','work_type', 'work_manage','marital','s_political','s_hukou','s_work_exper','s_work_status','s_work_type','f_political','f_work_14', 'm_political','m_work_14'] #已经是0、1的值不需要onehot noc_fea = [clo for clo in use_feature if clo not in cat_fea] onehot_data = data[cat_fea].values enc = preprocessing.OneHotEncoder(categories = 'auto') oh_data=enc.fit_transform(onehot_data).toarray() oh_data.shape #变为onehot编码格式 X_train_oh = oh_data[:train_shape,:] X_test_oh = oh_data[train_shape:,:] X_train_oh.shape #其中的训练集 X_train_383 = np.column_stack([data[:train_shape][noc_fea].values,X_train_oh])#先是noc,再是cat_fea X_test_383 = np.column_stack([data[train_shape:][noc_fea].values,X_test_oh]) X_train_383.shape 基于此,构建完成了三种特征工程(训练数据集),其一是上面提取的最重要的49中特征,其中包括健康程度、社会阶级、在同龄人中的收入情况等等特征。其二是扩充后的263维特征(这里可以认为是初始特征)。其三是使用One-hot编码后的特征,这里要使用One-hot进行编码的原因在于,有部分特征为分离值,例如性别中男女,男为1,女为2,我们想使用One-hot将其变为男为0,女为1,来增强机器学习算法的鲁棒性能;再如民族这个特征,原本是1-56这56个数值,如果直接分类会让分类器的鲁棒性变差,所以使用One-hot编码将其变为6个特征进行非零即一的处理。 6.特征建模 首先对于原始的263维的特征,使用lightGBM进行处理,并进行5折交叉验证: ##### lgb_263 # #lightGBM决策树 lgb_263_param = { 'num_leaves': 7, 'min_data_in_leaf': 20, #叶子可能具有的最小记录数 'objective':'regression', 'max_depth': -1, 'learning_rate': 0.003, "boosting": "gbdt", #用gbdt算法 "feature_fraction": 0.18, #例如 0.18时,意味着在每次迭代中随机选择18%的参数来建树 "bagging_freq": 1, "bagging_fraction": 0.55, #每次迭代时用的数据比例 "bagging_seed": 14, "metric": 'mse', "lambda_l1": 0.1, "lambda_l2": 0.2, "verbosity": -1} folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=4) #交叉切分:5 oof_lgb_263 = np.zeros(len(X_train_263)) predictions_lgb_263 = np.zeros(len(X_test_263)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)): print("fold n°{}".format(fold_+1)) trn_data = lgb.Dataset(X_train_263[trn_idx], y_train[trn_idx]) val_data = lgb.Dataset(X_train_263[val_idx], y_train[val_idx])#train:val=4:1 num_round = 10000 lgb_263 = lgb.train(lgb_263_param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 800) oof_lgb_263[val_idx] = lgb_263.predict(X_train_263[val_idx], num_iteration=lgb_263.best_iteration) predictions_lgb_263 += lgb_263.predict(X_test_263, num_iteration=lgb_263.best_iteration) / folds.n_splits print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb_263, target))) 接着,使用已经训练完的lightGBM的模型进行特征重要性的判断以及可视化,从结果我们可以看出,排在重要性第一位的是health/age,就是同龄人中的健康程度,与主观的看法基本一致。 #---------------特征重要性 pd.set_option('display.max_columns', None) #显示所有行 pd.set_option('display.max_rows', None) #设置value的显示长度为100,默认为50 pd.set_option('max_colwidth',100) df = pd.DataFrame(data[use_feature].columns.tolist(), columns=['feature']) df['importance']=list(lgb_263.feature_importance()) df = df.sort_values(by='importance',ascending=False) plt.figure(figsize=(14,28)) sns.barplot(x="importance", y="feature", data=df.head(50)) plt.title('Features importance (averaged/folds)') plt.tight_layout() 后面,使用常见的机器学习方法,对于263维特征进行建模: ##### xgb_263 #xgboost xgb_263_params = {'eta': 0.02, #lr 'max_depth': 6, 'min_child_weight':3,#最小叶子节点样本权重和 'gamma':0, #指定节点分裂所需的最小损失函数下降值。 'subsample': 0.7, #控制对于每棵树,随机采样的比例 'colsample_bytree': 0.3, #用来控制每棵随机采样的列数的占比 (每一列是一个特征)。 'lambda':2, 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': -1} folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2019) oof_xgb_263 = np.zeros(len(X_train_263)) predictions_xgb_263 = np.zeros(len(X_test_263)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)): print("fold n°{}".format(fold_+1)) trn_data = xgb.DMatrix(X_train_263[trn_idx], y_train[trn_idx]) val_data = xgb.DMatrix(X_train_263[val_idx], y_train[val_idx]) watchlist = [(trn_data, 'train'), (val_data, 'valid_data')] xgb_263 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600, verbose_eval=500, params=xgb_263_params) oof_xgb_263[val_idx] = xgb_263.predict(xgb.DMatrix(X_train_263[val_idx]), ntree_limit=xgb_263.best_ntree_limit) predictions_xgb_263 += xgb_263.predict(xgb.DMatrix(X_test_263), ntree_limit=xgb_263.best_ntree_limit) / folds.n_splits print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb_263, target)))