集成学习之Boosting —— Gradient Boosting实现

简介: Gradient Boosting的一般算法流程初始化: \(f_0(x) = \mathop{\arg\min}\limits_\gamma \sum\limits_{i=1}^N L(y_i, \gamma)\)for m=1 to M:(a) 计算负梯度: $\tilde{y}_...


Gradient Boosting的一般算法流程

  1. 初始化: \(f_0(x) = \mathop{\arg\min}\limits_\gamma \sum\limits_{i=1}^N L(y_i, \gamma)\)

  2. for m=1 to M:

    (a) 计算负梯度: $\tilde{y}_i = -\frac{\partial L(y_i,f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}, \qquad i = 1,2 \cdots N$
    
    (b) 通过最小化平方误差,用基学习器$h_m(x)$拟合$\tilde{y_i}$,$w_m = \mathop{\arg\min}\limits_w \sum\limits_{i=1}^{N} \left[\tilde{y}_i - h_m(x_i\,;\,w) \right]^2$
    
    (c) 使用line search确定步长$\rho_m$,以使L最小,$\rho_m = \mathop{\arg\min}\limits_{\rho} \sum\limits_{i=1}^{N} L(y_i,f_{m-1}(x_i) + \rho h_m(x_i\,;\,w_m))$
    
    (d) $f_m(x) = f_{m-1}(x) + \rho_m h_m(x\,;\,w_m)$
  3. 输出\(f_M(x)\)

  • 另外具体实现了early_stopping,回归,分类和分步预测 (stage_predict,见完整代码)。

  • Gradient Boostig一般有一个初始值存在,即上面第一步中的\(f_0(x)\),在实现的时候这个初始值是不能乘学习率的,因为乘的话等于变相改变了初始值,会产生一些意想不到的结果 (很不幸我就犯过这个错误 ~) 。

# 先定义各类损失函数,回归有squared loss、huber loss;分类有logistic loss,modified huber loss
def SquaredLoss_NegGradient(y_pred, y):
    return y - y_pred

def Huberloss_NegGradient(y_pred, y, alpha):
    diff = y - y_pred
    delta = stats.scoreatpercentile(np.abs(diff), alpha * 100)
    g = np.where(np.abs(diff) > delta, delta * np.sign(diff), diff)
    return g

def logistic(p):
    return 1 / (1 + np.exp(-2 * p))

def LogisticLoss_NegGradient(y_pred, y):
    g = 2 * y / (1 + np.exp(1 + 2 * y * y_pred))  # logistic_loss = log(1+exp(-2*y*y_pred))
    return g

def modified_huber(p):
    return (np.clip(p, -1, 1) + 1) / 2

def Modified_Huber_NegGradient(y_pred, y):
    margin = y * y_pred
    g = np.where(margin >= 1, 0, np.where(margin >= -1, y * 2 * (1-margin), 4 * y))
    # modified_huber_loss = np.where(margin >= -1, max(0, (1-margin)^2), -4 * margin)
    return g


class GradientBoosting(object):
    def __init__(self, M, base_learner, learning_rate=1.0, method="regression", tol=None, subsample=None,
                 loss="square", alpha=0.9):
        self.M = M
        self.base_learner = base_learner
        self.learning_rate = learning_rate
        self.method = method
        self.tol = tol
        self.subsample = subsample
        self.loss = loss
        self.alpha = alpha

    def fit(self, X, y):
        # tol为early_stopping的阈值,如果使用early_stopping,则从训练集中分出验证集
        if self.tol is not None:
            X, X_val, y, y_val = train_test_split(X, y, random_state=2)
            former_loss = float("inf")
            count = 0
            tol_init = self.tol

        init_learner = self.base_learner
        y_pred = init_learner.fit(X, y).predict(X)   # 初始值
        self.base_learner_total = [init_learner]
        for m in range(self.M):

            if self.subsample is not None:  # subsample
                sample = [np.random.choice(len(X), int(self.subsample * len(X)), replace=False)]
                X_s, y_s, y_pred_s = X[sample], y[sample], y_pred[sample]  
            else:
                X_s, y_s, y_pred_s = X, y, y_pred

            # 计算负梯度
            if self.method == "regression":
                if self.loss == "square":
                    response = SquaredLoss_NegGradient(y_pred_s, y_s)
                elif self.loss == "huber":
                    response = Huberloss_NegGradient(y_pred_s, y_s, self.alpha)
            elif self.method == "classification":
                if self.loss == "logistic":
                    response = LogisticLoss_NegGradient(y_pred_s, y_s)
                elif self.loss == "modified_huber":
                    response = Modified_Huber_NegGradient(y_pred_s, y_s)

            base_learner = clone(self.base_learner)
            y_pred += base_learner.fit(X_s, response).predict(X) * self.learning_rate
            self.base_learner_total.append(base_learner)

            '''early stopping'''
            if m % 10 == 0 and m > 300 and self.tol is not None:
                p = np.array([self.base_learner_total[m].predict(X_val) for m in range(1, m+1)])
                p = np.vstack((self.base_learner_total[0].predict(X_val), p))
                stage_pred = np.sum(p, axis=0)
                if self.method == "regression":
                    later_loss = np.sqrt(mean_squared_error(stage_pred, y_val))
                if self.method == "classification":
                    stage_pred = np.where(logistic(stage_pred) >= 0.5, 1, -1)  
                    later_loss = zero_one_loss(stage_pred, y_val)

                if later_loss > (former_loss + self.tol):
                    count += 1
                    self.tol = self.tol / 2  
                    print(self.tol)          
                else:
                    count = 0
                    self.tol = tol_init
                    
                if count == 2:
                    self.M = m - 20
                    print("early stopping in round {}, best round is {}, M = {}".format(m, m - 20, self.M))
                    break
                former_loss = later_loss

        return self

    def predict(self, X):
        pred = np.array([self.base_learner_total[m].predict(X) * self.learning_rate for m in range(1, self.M + 1)])
        pred = np.vstack((self.base_learner_total[0].predict(X), pred))    # 初始值 + 各基学习器
        if self.method == "regression":
            pred_final = np.sum(pred, axis=0)
        elif self.method == "classification":
            if self.loss == "modified_huber":
                p = np.sum(pred, axis=0)
                pred_final = np.where(modified_huber(p) >= 0.5, 1, -1)
            elif self.loss == "logistic":
                p = np.sum(pred, axis=0)
                pred_final = np.where(logistic(p) >= 0.5, 1, -1)
        return pred_final


class GBRegression(GradientBoosting):
    def __init__(self, M, base_learner, learning_rate, method="regression", loss="square",tol=None, subsample=None, alpha=0.9):
        super(GBRegression, self).__init__(M=M, base_learner=base_learner, learning_rate=learning_rate, method=method, 
                                            loss=loss, tol=tol, subsample=subsample, alpha=alpha)

class GBClassification(GradientBoosting):
    def __init__(self, M, base_learner, learning_rate, method="classification", loss="logistic", tol=None, subsample=None):
        super(GBClassification, self).__init__(M=M, base_learner=base_learner, learning_rate=learning_rate, method=method,
                                                loss=loss, tol=tol, subsample=subsample)


if __name__ == "__main__":
    # 创建数据集进行测试
    X, y = datasets.make_regression(n_samples=20000, n_features=10, n_informative=4, noise=1.1, random_state=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    model = GBRegression(M=1000, base_learner=DecisionTreeRegressor(max_depth=2, random_state=1), learning_rate=0.1,
                         loss="huber")
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    print('RMSE: ', rmse)

    X, y = datasets.make_classification(n_samples=20000, n_features=10, n_informative=4, flip_y=0.1, 
                                    n_clusters_per_class=1, n_classes=2, random_state=1)
    y[y==0] = -1
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = GBClassification(M=1000, base_learner=DecisionTreeRegressor(max_depth=1, random_state=1), learning_rate=1.0,
                             method="classification", loss="logistic")
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = np.zeros(pred.shape)
    acc[np.where(pred == y_test)] = 1
    accuracy = np.sum(acc) / len(pred)
    print('accuracy logistic score: ', accuracy)

    model = GBClassification(M=1000, base_learner=DecisionTreeRegressor(max_depth=1, random_state=1), learning_rate=1.0,
                             method="classification", loss="modified_huber")
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = np.zeros(pred.shape)
    acc[np.where(pred == y_test)] = 1
    accuracy = np.sum(acc) / len(pred)
    print('accuracy modified_huber score: ', accuracy)

输出结果:

RMSE:  8.454462867923157
accuracy logistic score:  0.9434
accuracy modified_huber score:  0.9402


回归:

X, y = datasets.make_regression(n_samples=20000, n_features=20, n_informative=10, noise=100, random_state=1)  # 数据集

下图比较了回归问题中使用平方损失和Huber损失的差别以及各自的early stopping point:


分类:

在分类问题中将上一篇中的 AdaBoost 和本篇中的GBDT作比较,仍使用之前的数据集,其中GBDT分别使用了logistic loss和 这篇文章 最后提到的modified huber loss:


下面换一个噪音较大的数据集,用PCA降到二维进行可视化:

X, y = datasets.make_classification(n_samples=20000, n_features=10, n_informative=4, flip_y=0.3, n_clusters_per_class=1, n_classes=2, random_state=1)


这一次modified loss比logistic loss表现好,但都不如Real AdaBoost。





/

相关文章
|
21天前
|
机器学习/深度学习 Python
堆叠集成策略的原理、实现方法及Python应用。堆叠通过多层模型组合,先用不同基础模型生成预测,再用元学习器整合这些预测,提升模型性能
本文深入探讨了堆叠集成策略的原理、实现方法及Python应用。堆叠通过多层模型组合,先用不同基础模型生成预测,再用元学习器整合这些预测,提升模型性能。文章详细介绍了堆叠的实现步骤,包括数据准备、基础模型训练、新训练集构建及元学习器训练,并讨论了其优缺点。
41 3
|
2月前
|
测试技术
软件质量保护与测试(第2版)学习总结第十三章 集成测试
本文是《软件质量保护与测试》(第2版)第十三章的学习总结,介绍了集成测试的概念、主要任务、测试层次与原则,以及集成测试的不同策略,包括非渐增式集成和渐增式集成(自顶向下和自底向上),并通过图示详细解释了集成测试的过程。
72 1
软件质量保护与测试(第2版)学习总结第十三章 集成测试
|
2月前
|
前端开发 Java 程序员
springboot 学习十五:Spring Boot 优雅的集成Swagger2、Knife4j
这篇文章是关于如何在Spring Boot项目中集成Swagger2和Knife4j来生成和美化API接口文档的详细教程。
174 1
|
2月前
|
Java Spring
springboot 学习十一:Spring Boot 优雅的集成 Lombok
这篇文章是关于如何在Spring Boot项目中集成Lombok,以简化JavaBean的编写,避免冗余代码,并提供了相关的配置步骤和常用注解的介绍。
120 0
|
2月前
|
机器学习/深度学习 算法 前端开发
集成学习任务七和八、投票法与bagging学习
集成学习任务七和八、投票法与bagging学习
27 0
|
2月前
|
机器学习/深度学习 算法
【机器学习】迅速了解什么是集成学习
【机器学习】迅速了解什么是集成学习
|
4月前
|
人工智能
LLama+Mistral+…+Yi=? 免训练异构大模型集成学习框架DeePEn来了
【8月更文挑战第6天】DeePEn是一种免训练异构大模型集成学习框架,旨在通过融合多个不同架构和参数的大模型输出概率分布,提升整体性能。它首先将各模型输出映射至统一概率空间,然后进行聚合,并最终反转回单一模型空间以生成输出。实验证明,在知识问答和推理任务上,DeePEn相比单一大模型如LLaMA和Mistral有显著提升,但其效果受模型质量和数量影响,并且计算成本较高。[论文: https://arxiv.org/abs/2404.12715]
46 1
|
2月前
|
Java Maven Docker
gitlab-ci 集成 k3s 部署spring boot 应用
gitlab-ci 集成 k3s 部署spring boot 应用
|
21天前
|
消息中间件 监控 Java
您是否已集成 Spring Boot 与 ActiveMQ?
您是否已集成 Spring Boot 与 ActiveMQ?
41 0
|
5月前
|
监控 druid Java
spring boot 集成配置阿里 Druid监控配置
spring boot 集成配置阿里 Druid监控配置
316 6

热门文章

最新文章