卡方分箱、KS分箱、最优IV分箱、树结构分箱、自定义分箱-阿里云开发者社区

分箱

分箱的概念

什么是分箱？如果你初入机器学习的道路，你可能比较的懵逼，为什么要分箱？

数据分箱指的是将连续数据离散化；离散化对异常值具有鲁棒性，运算更快方便存储，而且特征可变性更强方便迭代，特征离散后的模型更加稳定。

那是因为分箱对有些模型带来的好处比起调参优化都要来的直接

举个例子

决策树可以构建更为复杂的数据模型，但这强烈依赖于数据表示。有一种方法可以让线性模型在连续数据上变得更加强大，就是使用特征分箱（binning，也叫离散化，即 discretization）将其划分为多个特征，如下所述。

比如数据集很大、维度很高，但有些特征与输出的关系是非线性的——那么分箱是提高建模能力的好方法。

卡方分箱及代码实现

1.先确定最终分几个箱，也就是最后分几个离散值。

2.如果变量样本大于100，那么先等距的划分为100箱。

3.计算每一对相邻箱间的卡方值

4.将卡方值最小的两个区间合并，一直重复3-4直到满足最终分箱个数。

def tagcount(series,tags):
    """
    统计该series中不同标签的数量，可以针对多分类
    series:只含有标签的series
    tags:为标签的列表，以实际为准，比如[0,1],[1,2,3]
    """
    result = []
    countseries = series.value_counts()
    for tag in tags:
        try:
            result.append(countseries[tag])
        except:
            result.append(0)
    return result
def ChiMerge3(df, num_split,tags=[1,2,3],pvalue_edge=0.1,biggest=10,smallest=3,sample=None):  
    """
    df:只包含要分箱的参数列和标签两列
    num_split:初始化时划分的区间个数,适合数据量特别大的时候。
    tags：标签列表，二分类一般为[0,1]。以实际为准。
    pvalue_edge：pvalue的置信度值
    bin：最多箱的数目
    smallest:最少箱的数目
    sample:抽样的数目，适合数据量超级大的情况。可以使用抽样的数据进行分箱。百万以下不需要
    """
    import pandas as pd
    import numpy as np
    import scipy
    variable = df.columns[0]
    flag = df.columns[1]
#进行是否抽样操作
    if sample != None:
        df = df.sample(n=sample)
    else:
        df   
#将原始序列初始化为num_split个区间，计算每个区间中每类别的数量，放置在一个矩阵中。方便后面计算pvalue值。    
    percent = df[variable].quantile([1.0*i/num_split for i in range(num_split+1)],interpolation= "lower").drop_duplicates(keep="last").tolist()
    percent = percent[1:]
    np_regroup = []
    for i in range(len(percent)):
        if i == 0:
            tempdata = tagcount(df[df[variable]<=percent[i]][flag],tags)
            tempdata.insert(0,percent[i])
        elif i == len(percent)-1:
            tempdata = tagcount(df[df[variable]>percent[i-1]][flag],tags)
            tempdata.insert(0,percent[i])
        else:
            tempdata = tagcount(df[(df[variable]>percent[i-1])&(df[variable]<=percent[i])][flag],tags)
            tempdata.insert(0,percent[i])
        np_regroup.append(tempdata)
    np_regroup = pd.DataFrame(np_regroup)
    np_regroup = np.array(np_regroup)
#如果两个区间某一类的值都为0，就会报错。先将这类的区间合并，当做预处理吧
    i = 0
    while (i <= np_regroup.shape[0] - 2):
        check = 0
        for j in range(len(tags)):
            if np_regroup[i,j+1] ==0 and np_regroup[i+1,j+1]==0:
                check += 1
        """
        这个for循环是为了检查是否有某一个或多个标签在两个区间内都是0，如果是的话，就进行下面的合并。
        """
        if check>0:
            np_regroup[i,1:] = np_regroup[i,1:] + np_regroup[i+1,1:]
            np_regroup[i, 0] = np_regroup[i + 1, 0]
            np_regroup = np.delete(np_regroup, i + 1, 0)
            i = i - 1
        i = i + 1
#对相邻两个区间进行置信度计算
    chi_table = np.array([])
    for i in np.arange(np_regroup.shape[0] - 1):
        temparray = np_regroup[i:i+2,1:]
        pvalue = scipy.stats.chi2_contingency(temparray,correction=False)[1]
        chi_table = np.append(chi_table, pvalue)
    temp = max(chi_table)
#把pvalue最大的两个区间进行合并。注意的是，这里并没有合并一次就重新循环计算相邻区间的pvalue，而是只更新影响到的区间。
    while (1):
        #终止条件，可以根据自己的期望定制化
        if (len(chi_table) <= (biggest - 1) and temp <= pvalue_edge):
            break
        if len(chi_table)<smallest:
            break
        num = np.argwhere(chi_table==temp)
        for i in range(num.shape[0]-1,-1,-1):
            chi_min_index = num[i][0]
            np_regroup[chi_min_index, 1:] = np_regroup[chi_min_index, 1:] + np_regroup[chi_min_index + 1, 1:]
            np_regroup[chi_min_index, 0] = np_regroup[chi_min_index + 1, 0]
            np_regroup = np.delete(np_regroup, chi_min_index + 1, 0)
            #最大pvalue在最后两个区间的时候，只需要更新一个，删除最后一个。大家可以画图，很容易明白
            if (chi_min_index == np_regroup.shape[0] - 1):
                temparray = np_regroup[chi_min_index-1:chi_min_index+1,1:]
                chi_table[chi_min_index - 1] = scipy.stats.chi2_contingency(temparray,correction=False)[1]
                chi_table = np.delete(chi_table, chi_min_index, axis=0)
            #最大pvalue是最先两个区间的时候，只需要更新一个，删除第一个。
            elif (chi_min_index == 0):
                temparray = np_regroup[chi_min_index:chi_min_index+2,1:]
                chi_table[chi_min_index] = scipy.stats.chi2_contingency(temparray,correction=False)[1]
                chi_table = np.delete(chi_table, chi_min_index+1, axis=0)
            #最大pvalue在中间的时候，影响和前后区间的pvalue，需要更新两个值。
            else:
                # 计算合并后当前区间与前一个区间的pvalue替换
                temparray = np_regroup[chi_min_index-1:chi_min_index+1,1:]
                chi_table[chi_min_index - 1] = scipy.stats.chi2_contingency(temparray,correction=False)[1]
                # 计算合并后当前与后一个区间的pvalue替换
                temparray = np_regroup[chi_min_index:chi_min_index+2,1:]
                chi_table[chi_min_index] = scipy.stats.chi2_contingency(temparray,correction=False)[1]
                # 删除替换前的pvalue
                chi_table = np.delete(chi_table, chi_min_index + 1, axis=0)
        #更新当前最大的相邻区间的pvalue
        temp = max(chi_table)
    print("*"*40)
    print("最终相邻区间的pvalue值为：")
    print(chi_table)
    print("*"*40)
    #把结果保存成一个数据框。
    """
    可以根据自己的需求定制化。我保留两个结果。
    1. 显示分割区间，和该区间内不同标签的数量的表
    2. 为了方便pandas对该参数处理，把apply的具体命令打印出来。方便直接对数据集处理。
        serise.apply(lambda x:XXX)中XXX的位置
    """
    #将结果整合到一个表中，即上述中的第一个
    interval = []
    interval_num = np_regroup.shape[0]
    for i in range(interval_num):
        if i == 0:
            interval.append('x<=%f'%(np_regroup[i,0]))
        elif i == interval_num-1:
            interval.append('x>%f'%(np_regroup[i-1,0]))
        else:
            interval.append('x>%f and x<=%f'%(np_regroup[i-1,0],np_regroup[i,0]))
    result = pd.DataFrame(np_regroup)
    result[0] = interval
    result.columns = ['interval']+tags
    #整理series的命令，即上述中的第二个
    premise = "str(0) if "
    length_interval = len(interval)
    for i in range(length_interval):
        if i == length_interval-1:
            premise = premise[:-4]
            break
        premise = premise + interval[i] + " else " + 'str(%d+1)'%i + " if "
    return result,premise

pandas.cut:
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
参数：
x，类array对象，且必须为一维，待切割的原形式
bins, 整数、序列尺度、或间隔索引。如果bins是一个整数，它定义了x宽度范围内的等宽面元数量，但是在这种情况下，x的范围在每个边上被延长1%，以保证包括x的最小值或最大值。如果bin是序列，它定义了允许非均匀in宽度的bin边缘。在这种情况下没有x的范围的扩展。
right,布尔值。是否是左开右闭区间
labels,用作结果箱的标签。必须与结果箱相同长度。如果FALSE，只返回整数指标面元。
retbins,布尔值。是否返回面元
precision，整数。返回面元的小数点几位
include_lowest，布尔值。第一个区间的左端点是否包含
t/cc_jjj/article/details/78878878

自定义分箱代码实现

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# 有一组人员年龄数据，希望将这些数据划分为“18到25”,“26到35”,“36到60”,“60以上”几个面元
bins = [18, 25, 35, 60, 100]
# 返回的是一个特殊的Categorical对象 → 一组表示面元名称的字符串
cats = pd.cut(ages, bins)
print(cats)
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']###打上标签
cats1 = pd.cut(ages, bins, labels=group_names)
print(cats1)
aa = pd.value_counts(cats)  # 按照区间计数
print(aa)

变量分箱对模型的好处

1、降低异常值的影响，增加模型的稳定性

通过分箱来降低噪声，使模型鲁棒性更好

2、缺失值作为特殊变量参与分箱，减少缺失值填补的不确定性（分箱还可以解决缺失值）

分箱的方法往往要配合变量编码使用，这就大大提高了变量的可解释性

3、增加变量的非线性

提高了模型的拟合能力

4、增加模型的预测效果

通常假设训练集和测试集满足同分布，分箱使连续变量离散化，更容易满足同分布的假设

即减少模型在训练集的表现和测试集的偏差

卡方分箱、KS分箱、最优IV分箱、树结构分箱、自定义分箱

分箱

分箱的概念

卡方分箱及代码实现

自定义分箱代码实现

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

卡方分箱、KS分箱、最优IV分箱、树结构分箱、自定义分箱

分箱

分箱的概念

卡方分箱及代码实现

自定义分箱代码实现

热门文章

最新文章

相关电子书