使用Python线性回归预测Steam游戏的打折的幅度（二）-阿里云开发者社区

使用Python线性回归预测Steam游戏的打折的幅度（二）

2022-12-13 148

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用Python线性回归预测Steam游戏的打折的幅度（二）

第一次尝试：基本模型,删除评论少于30条的游戏

# Setting a floor limit of 30
df1 = df1[df1.Reviews > 30]
Best Model: Lasso
Score: 0.419 +- 0.073

第二次：“Reviews” & “OriginalPrice” 进行对数变换

df2.Reviews = np.log(df2.Reviews)
df2.OriginalPrice = df2.OriginalPrice.astype(float)
df2.OriginalPrice = np.log(df2.OriginalPrice)
Best Model: Lasso
Score: 0.437 +- 0.104

第三次：将mantag进行onehot编码

# Checking to make sure the dummies are separated correctly
pd.get_dummies(df3.Main_Tag).head(5)
# Adding dummy categories into the dataframe
df3 = pd.concat([df3, pd.get_dummies(df3.Main_Tag).astype(int)], axis = 1)
# Drop original string based column to avoid conflict in linear regression
df3.drop('Main_Tag', axis = 1, inplace=True)
Best Model: Lasso
Score: 0.330 +- 0.073

第四次：尝试把所有非数值数据都进行onehot编码

# we can get dummies for each tag listed separated by comma
split_tag = df4.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ')
# Now merge the dummies into the data frame to start EDA
df4= pd.concat([df4, split_tag], axis=1)
# Remove any column that only has value of 0 as precaution
df4 = df4.loc[:, (df4 != 0).any(axis=0)]
Best Model: Lasso
Score: 0.359 +- 0.080

第五次：整合2和4次操作

# Dummy all top 5 tags
split_tag = df.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ')
df5= pd.concat([df5, split_tag], axis=1)
# Log transform Review due to skewed pairplot graphs
df5['Log_Review'] = np.log(df5['Reviews'])
Best Model: Lasso
Score: 0.359 +- 0.080

看到结果后，发现与第4次得分完全相同，这意味着“评论”对折扣百分比绝对没有影响。所以这一步操作可以不做，对结果没有任何影响

第六次：对将“评论”和“发布后的天数”进行特殊处理

# Binning reviews (which is highly correlated with popularity) based on the above 75 percentile and 25 percentile
df6.loc[df6['Reviews'] < 33, 'low_pop'] = 1
df6.loc[(df6.Reviews >= 33) & (df6.Reviews < 381), 'mid_pop'] = 1
df6.loc[df6['Reviews'] >= 381, 'high_pop'] = 1
# Binning Days_Since_Release based on the above 75 percentile and 25 percentile
df6.loc[df6['Days_Since_Release'] < 418, 'new_game'] = 1
df6.loc[(df6.Days_Since_Release >= 418) & (df6.Days_Since_Release < 1716), 'established_game'] = 1
df6.loc[df6['Days_Since_Release'] >= 1716, 'old_game'] = 1
# Fill all the NaN's
df6.fillna(0, inplace = True)
# Drop the old columns to avoid multicolinearity
df6.drop(['Reviews', 'Days_Since_Release'], axis=1, inplace = True)

这两列被分成三个特征。

Best Model: Ridge
Score: 0.273 +- 0.044

使用Python线性回归预测Steam游戏的打折的幅度（二）

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用Python线性回归预测Steam游戏的打折的幅度（二）

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像