我正在尝试将baseline模型应用到我的数据集上,但是数据集是不平衡的,只有11%的数据属于positive这一类。我没有抽样就分割了数据,阳性记录的召回率很低。我想在不平衡测试数据的情况下平衡训练数据(0.5 - 0.5阳性)。有人知道怎么做吗?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
问题来源StackOverflow 地址:/questions/59465731/how-to-balance-training-set-in-python
您必须平衡数据的方式:上采样或下采样。
向上采样:重复处理代表性不足的数据。下采样:采样过多的数据。
对于上采样,这非常容易。对于下采样,您可以使用sklearn.utils.resample并提供要获取的样本数量。
请注意,正如@ paritosh-singh所述,更改发行版可能不是唯一的解决方案。有机器学习算法可以:-支持不平衡数据-已经具有内置的加权选项来考虑数据分布
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。