第一次尝试:基本模型,删除评论少于30条的游戏
# Setting a floor limit of 30 df1 = df1[df1.Reviews > 30] Best Model: Lasso Score: 0.419 +- 0.073
第二次:“Reviews” & “OriginalPrice” 进行对数变换
df2.Reviews = np.log(df2.Reviews) df2.OriginalPrice = df2.OriginalPrice.astype(float) df2.OriginalPrice = np.log(df2.OriginalPrice) Best Model: Lasso Score: 0.437 +- 0.104
第三次:将mantag进行onehot编码
# Checking to make sure the dummies are separated correctly pd.get_dummies(df3.Main_Tag).head(5) # Adding dummy categories into the dataframe df3 = pd.concat([df3, pd.get_dummies(df3.Main_Tag).astype(int)], axis = 1) # Drop original string based column to avoid conflict in linear regression df3.drop('Main_Tag', axis = 1, inplace=True) Best Model: Lasso Score: 0.330 +- 0.073
第四次:尝试把所有非数值数据都进行onehot编码
# we can get dummies for each tag listed separated by comma split_tag = df4.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ') # Now merge the dummies into the data frame to start EDA df4= pd.concat([df4, split_tag], axis=1) # Remove any column that only has value of 0 as precaution df4 = df4.loc[:, (df4 != 0).any(axis=0)] Best Model: Lasso Score: 0.359 +- 0.080
第五次:整合2和4次操作
# Dummy all top 5 tags split_tag = df.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ') df5= pd.concat([df5, split_tag], axis=1) # Log transform Review due to skewed pairplot graphs df5['Log_Review'] = np.log(df5['Reviews']) Best Model: Lasso Score: 0.359 +- 0.080
看到结果后,发现与第4次得分完全相同,这意味着“评论”对折扣百分比绝对没有影响。所以这一步操作可以不做,对结果没有任何影响
第六次:对将“评论”和“发布后的天数”进行特殊处理
# Binning reviews (which is highly correlated with popularity) based on the above 75 percentile and 25 percentile df6.loc[df6['Reviews'] < 33, 'low_pop'] = 1 df6.loc[(df6.Reviews >= 33) & (df6.Reviews < 381), 'mid_pop'] = 1 df6.loc[df6['Reviews'] >= 381, 'high_pop'] = 1 # Binning Days_Since_Release based on the above 75 percentile and 25 percentile df6.loc[df6['Days_Since_Release'] < 418, 'new_game'] = 1 df6.loc[(df6.Days_Since_Release >= 418) & (df6.Days_Since_Release < 1716), 'established_game'] = 1 df6.loc[df6['Days_Since_Release'] >= 1716, 'old_game'] = 1 # Fill all the NaN's df6.fillna(0, inplace = True) # Drop the old columns to avoid multicolinearity df6.drop(['Reviews', 'Days_Since_Release'], axis=1, inplace = True)
这两列被分成三个特征。
Best Model: Ridge Score: 0.273 +- 0.044