开发者社区> 问答> 正文

如何在python中平衡训练集?

我正在尝试将baseline模型应用到我的数据集上,但是数据集是不平衡的,只有11%的数据属于positive这一类。我没有抽样就分割了数据,阳性记录的召回率很低。我想在不平衡测试数据的情况下平衡训练数据(0.5 - 0.5阳性)。有人知道怎么做吗?

#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)

##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]

#Function attributes
#dataframe     - processed dataframe
#Algorithm     - Algorithm used 
#training_x    - predictor variables dataframe(training)
#testing_x     - predictor variables dataframe(testing)
#training_y    - target variable(training)
#training_y    - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic 
#regression,features for tree based models)

#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
                         training_y,testing_y,cols,cf,threshold_plot) :

#model
algorithm.fit(training_x,training_y)
predictions   = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if   cf == "coefficients" :
    coefficients  = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
    coefficients  = pd.DataFrame(algorithm.feature_importances_)

column_df     = pd.DataFrame(cols)
coef_sumry    = (pd.merge(coefficients,column_df,left_index= True,
                          right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry    = coef_sumry.sort_values(by = "coefficients",ascending = False)

print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy   Score : ",accuracy_score(testing_y,predictions))

问题来源StackOverflow 地址:/questions/59465731/how-to-balance-training-set-in-python

展开
收起
kun坤 2019-12-25 16:09:44 419 0
1 条回答
写回答
取消 提交回答
  • 您必须平衡数据的方式:上采样或下采样。

    向上采样:重复处理代表性不足的数据。下采样:采样过多的数据。

    对于上采样,这非常容易。对于下采样,您可以使用sklearn.utils.resample并提供要获取的样本数量。

    请注意,正如@ paritosh-singh所述,更改发行版可能不是唯一的解决方案。有机器学习算法可以:-支持不平衡数据-已经具有内置的加权选项来考虑数据分布

    2019-12-25 17:02:58
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
From Python Scikit-Learn to Sc 立即下载
Data Pre-Processing in Python: 立即下载
双剑合璧-Python和大数据计算平台的结合 立即下载