Paper：《CatBoost: unbiased boosting with categorical features》的翻译与解读（二）-阿里云开发者社区

Related work on prediction shift与预测偏移相关的工作

The shift of gradient conditional distribution g t (xk, yk) | xk was previously mentioned in papers on boosting [3, 13] but was not formally defined. Moreover, even the existence of non-zero shift was not proved theoretically. Based on the out-of-bag estimation [2], Breiman proposed iterated bagging [3] which constructs a bagged weak learner at each iteration on the basis of “out-of-bag” residual estimates. However, as we formally show in Appendix E, such residual estimates are still shifted. Besides, the bagging scheme increases learning time by factor of the number of data buckets. Subsampling of the dataset at each iteration proposed by Friedman [13] addresses the problem much more heuristically and also only alleviates it.

梯度条件分布g t (xk, yk) | xk的移位在之前关于Boosting的文献中有提及[3,13]，但没有正式定义。而且，理论上甚至没有证明非零位移的存在。基于out- bag估计[2]，Breiman提出了迭代bagging[3]，该算法基于out- bag残差估计在每次迭代时构造一个bagging弱学习者。然而，正如我们在附录E中正式说明的那样，这些残差估计仍然发生了偏移。此外，套袋方案通过增加数据桶数量来增加学习时间。Friedman[13]提出的在每次迭代时对数据集进行子抽样的方法更加启发式地解决了这个问题，并且只能缓解该问题。

Analysis of prediction shift预测偏移的分析

We formally analyze the problem of prediction shift in a simple case of a regression task with the quadratic loss function L(y, yˆ) = (y − yˆ) 2 . 4 In this case, the negative gradient −g t (xk, yk) in Equation (6) can be substituted by the residual function r t−1 (xk, yk) := yk − F t−1 (xk). 5 Assume we have m = 2 features x 1 , x2 that are i.i.d. Bernoulli random variable with p = 1/2 and y = f ∗ (x) = c1x 1 + c2x 2 . Assume we make N = 2 steps of gradient boosting with decision stumps (trees of depth 1) and step size α = 1. We obtain a model F = F 2 = h 1 + h 2 . W.l.o.g., we assume that h 1 is based on x 1 and h 2 is based on x 2 , what is typical for |c1| > |c2| (here we set some asymmetry between x 1 and x 2 ).

我们以二次损失函数L（y，yˆ）=（y-yˆ）2的形式简单地分析了回归任务的简单预测转移问题。 4在这种情况下，公式（6）中的负梯度-g t（xk，yk）可以用残差函数r t-1（xk，yk）：= yk-F t-1（xk）代替。 5假设我们有m = 2个特征x 1，x2，即i.d. 伯努利随机变量，其中p = 1/2和y = f ∗（x）= c1x 1 + c2x 2。假设我们用决策树桩（深度为1的树）且步长为α= 1来进行N = 2步的梯度增强。我们获得模型F = F 2 = h 1 + h 2。 W.l.o.g.，我们假设h 1基于x 1，h 2基于x 2，这对于| c1 | > | c2 | （这里我们在x 1和x 2之间设置了一些不对称性）。

Theorem 1 1 定理1 1

Theorem 1 1. If two independent samples D1 and D2 of size n are used to estimate h 1 and h 2 , respectively, using Equation (6), then ED1,D2 F 2 (x) = f ∗ (x) + O(1/2 n) for any x ∈ {0, 1} 2 . 2. If the same dataset D = D1 = D2 is used in Equation (6) for both h 1 and h 2 , then EDF 2 (x) = f ∗ (x) − 1 n−1 c2(x 2 − 1 2 ) + O(1/2 n).

This theorem means that the trained model is an unbiased estimate of the true dependence y = f ∗ (x), when we use independent datasets at each gradient step.6 On the other hand, if we use the same dataset at each step, we suffer from a bias − 1 n−1 c2(x 2 − 1 2 ), which is inversely proportional to the data size n. Also, the value of the bias can depend on the relation f ∗ : in our example, it is proportional to c2. We track the chain of shifts for the second part of Theorem 1 in a sketch of the proof below, while the full proof of Theorem 1 is available in Appendix A.

Sketch of the proof . Denote by ξst, s, t ∈ {0, 1}, the number of examples (xk, yk) ∈ D with xk = (s, t). We have h 1 (s, t) = c1s + c2ξs1 ξs0+ξs1 . Its expectation E(h 1 (x)) on a test example x equals c1x 1 + c2 2 . At the same time, the expectation E(h 1 (xk)) on a training example xk is different and equals (c1x 1 + c2 2 ) − c2( 2x 2−1 n ) + O(2−n). That is, we experience a prediction shift of h 1 . As a consequence, the expected value of h 2 (x) is E(h 2 (x)) = c2(x 2 − 1 2 )(1 − 1 n−1 ) + O(2−n) on a test example x and E(h 1 (x) + h 2 (x)) = f ∗ (x) − 1 n−1 c2(x 2 − 1 2 ) + O(1/2 n).

Finally, recall that greedy TS xˆ i can be considered as a simple statistical model predicting the target y and it suffers from a similar problem, conditional shift of xˆ i k | yk, caused by the target leakage, i.e., using yk to compute xˆ i k .

定理1 1.如果使用等式（6）分别使用大小为n的两个独立样本D1和D2估计h 1和h 2，则ED1，D2 F 2（x）= f ∗（x）+ O（ 1/2 n）对于任何x∈{0，1} 2。 2.如果对于h 1和h 2在公式（6）中使用相同的数据集D = D1 = D2，则EDF 2（x）= f ∗（x）− 1 n−1 c2（x 2 − 1 2 ）+ O（1/2 n）。

该定理意味着，当我们在每个梯度步骤使用独立的数据集时，训练后的模型是真实依赖关系y = f ∗（x）的无偏估计。6另一方面，如果在每个梯度步骤中使用相同的数据集，我们偏置-1 n-1 c2（x 2-1 2）与数据大小n成反比。同样，偏置值可以取决于关系f ∗：在我们的示例中，它与c2成比例。我们在下面的证明草图中追踪定理1的第二部分的移位链，而定理1的完整证明可在附录A中找到。

证明草图。用ξst，s，t∈{0，1}表示xk =（s，t）的示例数（xk，yk）∈D。我们有h 1（s，t）= c1s +c2ξs1ξs0+ξs1。在测试示例x上的期望E（h 1（x））等于c1x 1 + c2 2。同时，对训练示例xk的期望E（h 1（xk））不同，并且等于（c1x 1 + c2 2）-c2（2x 2−1 n）+ O（2-n）。也就是说，我们经历了h 1的预测偏移。结果，在测试示例中，h 2（x）的期望值是E（h 2（x））= c2（x 2-1 2）（1-1 n-1）+ O（2-n） x和E（h 1（x）+ h 2（x））= f ∗（x）− 1 n-1 c2（x 2 − 1 2）+ O（1/2 n）。

最后，回想一下贪婪的TS xˆ i可以看作是预测目标y的简单统计模型，它也遇到类似的问题，即xˆ i k的条件转移。由目标泄漏引起的yk，即使用yk计算xˆ i k。

4.2 Ordered boosting有序提升

Here we propose a boosting algorithm which does not suffer from the prediction shift problem described in Section 4.1. Assuming access to an unlimited amount of training data, we can easily construct such an algorithm. At each step of boosting, we sample a new dataset Dt independently and obtain unshifted residuals by applying the current model to new training examples. In practice, however, labeled data is limited. Assume that we learn a model with I trees. To make the residual r I−1 (xk, yk) unshifted, we need to have F I−1 trained without the example xk. Since we need unbiased residuals for all training examples, no examples may be used for training F I−1 , which at first glance makes the training process impossible. However, it is possible to maintain a set of models differing by examples used for their training. Then, for calculating the residual on an example, we use a model trained without it. In order to construct such a set of models, we can use the ordering principle previously applied to TS in Section 3.2. To illustrate the idea, assume that we take one random permutation σ of the training examples and maintain n different supporting models M1, . . . , Mn such that the model Mi is learned using only the first i examples in the permutation. At each step, in order to obtain the residual for j-th sample, we use the model Mj−1 (see Figure 1). The resulting Algorithm 1 is called ordered boosting below. Unfortunately, this algorithm is not feasible in most practical tasks due to the need of training n different models, what increase the complexity and memory requirements by n times. In CatBoost, we implemented a modification of this algorithm on the basis of the gradient boosting algorithm with decision trees as base predictors (GBDT) described in Section 5.

这里我们提出了一种提升算法，它不会受到第4.1节中描述的预测偏移问题的影响。假设可以获得无限数量的训练数据，我们可以很容易地构建这样一个算法。在增强的每一步，我们独立地采样一个新的数据集Dt，并通过将当前模型应用到新的训练实例中获得未移位的残差。然而，在实践中，标记数据是有限的。假设我们学习一个有I树的模型。为了使残差r I−1 (xk, yk)不移位，我们需要训练F I−1，而不使用例子xk。由于我们需要对所有的训练例子都有无偏残差，所以不能用任何例子来训练F I−1，乍一看，这使得训练过程不可能进行。然而，由于训练中使用的例子不同，我们可以维护一组不同的模型。然后，为了计算一个实例上的残差，我们使用一个没有残差的训练模型。为了构建这样一组模型，我们可以使用之前在3.2节中应用于TS的排序原则。为了说明这一思想，假设我们取训练实例的一个随机排列σ，并维持n个不同的支持模型M1，…，使模型Mi仅使用排列中的前i个例子来学习。在每一步中，为了得到第j个样本的残差，我们使用模型Mj−1(见图1)，得到的算法1如下所示:ordered boosting。不幸的是，这种算法在大多数实际任务中都不可行，因为需要训练n个不同的模型，这增加了n倍的复杂性和内存需求。在CatBoost中，我们在第5节所述的以决策树作为基础预测器(GBDT)的梯度增强算法的基础上，实现了对该算法的改进。

Ordered boosting with categorical features In Sections 3.2 and 4.2 we proposed to use random permutations σcat and σboost of training examples for the TS calculation and for ordered boosting, respectively. Combining them in one algorithm, we should take σcat = σboost to avoid prediction shift. This guarantees that target yi is not used for training Mi (neither for the TS calculation, nor for the gradient estimation). See Appendix F for theoretical guarantees. Empirical results confirming the importance of having σcat = σboost are presented in Appendix G.

在第3.2节和4.2节中，我们分别提出了使用随机排列σcat和σboost的训练实例进行TS计算和有序增强。将它们结合在一个算法中，我们应该采用σcat =σboost来避免预测偏移。这保证了目标yi不用于训练Mi(既不用于TS计算，也不用于梯度估计)。理论保证见附录F。附录G给出了经验结果，证实了σcat =σboost的重要性。

5 Practical implementation of ordered boosting有序提升的实际实现

CatBoost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT algorithm with inbuilt ordered TS. The former mode presents an efficient modification of Algorithm 1. A formal description of the algorithm is included in Appendix B. In this section, we overview the most important implementation details.

CatBoost有两种模式，有序模式和普通模式。后一种模式是标准的带内置有序TS的GBDT算法，前者是对算法1的有效改进。该算法的正式描述包含在附录b中。在本节中，我们将概述最重要的实现细节。

At the start, CatBoost generates s + 1 independent random permutations of the training dataset. The permutations σ1, . . . , σs are used for evaluation of splits that define tree structures (i.e., the internal nodes), while σ0 serves for choosing the leaf values bj of the obtained trees (see Equation (3)). For examples with short history in a given permutation, both TS and predictions used by ordered boosting (Mσ(i)−1(xi) in Algorithm 1) have a high variance. Therefore, using only one permutation may increase the variance of the final model predictions, while several permutations allow us to reduce this effect in a way we further describe. The advantage of several permutations is confirmed by our experiments in Section 6.

在开始时，CatBoost对训练数据集生成s + 1个独立的随机排列。排列σ1，…，σs用于评估定义树结构(即内部节点)的分片，而σ0用于选择所获得树的叶值bj(见式(3))。例如，在给定的排列中历史很短，有序提升使用的TS和预测(算法1中的Mσ(i)−1(xi))都有很大的方差。因此，仅使用一种排列可能会增加最终模型预测的方差，而使用几种排列则允许我们以进一步描述的方式减少这种影响。我们在第6节的实验证实了几种排列的优点。

Building a tree建树

Building a tree In CatBoost, base predictors are oblivious decision trees [9, 14] also called decision tables [23]. Term oblivious means that the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up execution at testing time significantly. The procedure of building a tree in CatBoost is described in Algorithm 2.

In the Ordered boosting mode, during the learning process, we maintain the supporting models Mr,j , where Mr,j (i) is the current prediction for the i-th example based on the first j examples in the permutation σr. At each iteration t of the algorithm, we sample a random permutation σr from {σ1, . . . , σs} and construct a tree Tt on the basis of it. First, for categorical features, all TS are computed according to this permutation. Second, the permutation affects the tree learning procedure.

Namely, based on Mr,j (i), we compute the corresponding gradients gradr,j (i) = ∂L(yi,s) ∂s s=Mr,j (i) . Then, while constructing a tree, we approximate the gradient G in terms of the cosine similarity cos(·, ·), where, for each example i, we take the gradient gradr,σ(i)−1(i) (it is based only on the previous examples in σr). At the candidate splits evaluation step, the leaf value ∆(i) for example i is obtained individually by averaging the gradients gradr,σr(i)−1 of the preceding examples p lying in the same leaf leafr(i) the example i belongs to. Note that leafr(i) depends on the chosen permutation σr, because σr can influence the values of ordered TS for example i. When the tree structure Tt (i.e., the sequence of splitting attributes) is built, we use it to boost all the models Mr 0 ,j . Let us stress that one common tree structure Tt is used for all the models, but this tree is added to different Mr 0 ,j with different sets of leaf values depending on r 0 and j, as described in Algorithm 2.

The Plain boosting mode works similarly to a standard GBDT procedure, but, if categorical features are present, it maintains s supporting models Mr corresponding to TS based on σ1, . . . , σs.

在CatBoost中，基本的预测器是oblivious决策树[9,14]，也称为决策表[23]。术语无关是指在整个树的层次上使用相同的分割标准。这样的树是平衡的，不容易过度拟合，并允许在测试时间显著加快执行。在CatBoost中建立树的过程在算法2中描述。

在有序提升模式中，在学习过程中，我们保持支持模型Mr,j，其中Mr,j (i)是基于排列σr中前j个例子的对第i个例子的当前预测。在算法的每一次迭代t中，我们从{σ1，…，σs}，并在此基础上构造树Tt。首先，对于类别特征，根据这种排列计算所有的t。其次，排列影响树的学习过程。

即，基于Mr,j (i)，计算相应的梯度gradr,j (i) =∂L(yi,s)∂s s=Mr,j (i)。然后，在构造树的时候，我们用余弦相似度cos(·，·)来近似梯度G，其中，对于每个例子i，我们取梯度梯度r，σ(i)−1(i)(它仅基于之前的σr例子)。在候选拆分评估步骤中，叶片值∆(i)例如i分别通过对位于示例i所属叶片r(i)中的上述示例p的梯度gradr、σr(i)−1进行平均得到。注意，leafr(i)依赖于所选择的排列σr，因为σr可以影响有序TS的值，例如i。让我们强调一下，所有的模型都使用了一种常见的树结构Tt，但是这棵树根据r0和j的不同，被添加到不同的Mr 0,j上，有不同的叶片值集，如算法2所述。

纯提升模式的工作原理与标准的GBDT过程相似，但是，如果有分类特征，它在σ1的基础上保持与TS对应的s支持模型Mr，…σs。

Choosing leaf values选择叶子值

Choosing leaf values Given all the trees constructed, the leaf values of the final model F are calculated by the standard gradient boosting procedure equally for both modes. Training examples i are matched to leaves leaf0(i), i.e., we use permutation σ0 to calculate TS here. When the final model F is applied to a new example at testing time, we use TS calculated on the whole training data according to Section 3.2.

在给定所有构建的树的情况下，最终模型F的叶子值在两种模式下均通过标准梯度增强程序计算。训练示例i与叶子leaf0(i)匹配，即我们在这里使用排列σ0来计算TS。当在测试时将最终的模型F应用到一个新的示例时，我们使用3.2节对整个训练数据计算的TS。

Complexity复杂性

Complexity In our practical implementation, we use one important trick, which significantly reduces the computational complexity of the algorithm. Namely, in the Ordered mode, instead of O(s n2 ) values Mr,j (i), we store and update only the values M0 r,j (i) := Mr,2 j (i) for j = 1, . . . , dlog2 ne and all i with σr(i) ≤ 2 j+1, what reduces the number of maintained supporting predictions to O(s n). See Appendix B for the pseudocode of this modification of Algorithm 2.

In Table 1, we present the computational complexity of different components of both CatBoost modes per one iteration (see Appendix C.1 for the proof). Here NT S,t is the number of TS to be calculated at the iteration t and C is the set of candidate splits to be considered at the given iteration. It follows that our implementation of ordered boosting with decision trees has the same asymptotic complexity as the standard GBDT with ordered TS. In comparison with other types of TS (Section 3.2), ordered TS slow down by s times the procedures CalcGradient, updating supporting models M, and computation of TS.

在我们的实际实现中，我们使用了一个重要的技巧，它显著地降低了算法的计算复杂度。也就是说，在有序模式下，我们只存储和更新值M0 r,j (i):= Mr,2 j (i) for j = 1，…， dlog2 ne和σr(i)≤2 j+1的所有i，从而将维持支持预测的数量减少到O(s n)。该算法修改后的伪代码见附录B。

在表1中，我们给出了每次迭代时两种CatBoost模式不同组件的计算复杂度(证据见附录C.1)。这里NT S,t是在迭代t中要计算的t个数，C是在给定迭代中要考虑的候选分割集。由此可见,我们的实现的要求提高决策树具有相同的渐近的复杂性与标准GBDT下令TS。相比与其他类型的TS(3.2节),下令TS CalcGradient s倍慢下来的程序,更新支持模型M, TS的计算。

Feature combinations特征组合

Feature combinations Another important detail of CatBoost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint information of user ID and ad topic in the task of ad click prediction. The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them. CatBoost constructs combinations in a greedy way. Namely, for each split of a tree, CatBoost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset. Combinations are converted to TS on the fly.

CatBoost的另一个重要细节是使用分类特征的组合作为附加的分类特征，在广告点击预测任务中捕捉高阶依赖关系，如用户ID和广告主题的联合信息。可能的组合的数量随着分类特征的数量呈指数增长，对所有分类特征进行处理是不可行的。CatBoost以一种贪婪的方式构造组合。也就是说，对于树的每一个分割，CatBoost将所有分类特征(及其组合)与数据集中的所有分类特征结合(连接)在当前树中已经用于以前的分割的所有分类特征(及其组合)。组合被动态转换为TS。

Other important details其他重要细节

Other important details Finally, let us discuss two options of the CatBoost algorithm not covered above. The first one is subsampling of the dataset at each iteration of boosting procedure, as proposed by Friedman [13]. We claimed earlier in Section 4.1 that this approach alone cannot fully avoid the problem of prediction shift. However, since it has proved effective, we implemented it in both modes of CatBoost as a Bayesian bootstrap procedure. Specifically, before training a tree according to Algorithm 2, we assign a weight wi = a t i to each example i, where a t i are generated according to the Bayesian bootstrap procedure (see [28, Section 2]). These weights are used as multipliers for gradients gradr(i) and gradr,j (i), when we calculate ∆(i) and the components of the vector ∆ − G to define loss(Tc).

The second option deals with first several examples in a permutation. For examples i with small values σr(i), the variance of gradr,σr(i)−1(i) can be high. Therefore, we discard ∆(i) from the beginning of the permutation, when we calculate loss(Tc) in Algorithm 2. Particularly, we eliminate the corresponding components of vectors G and ∆ when calculating the cosine similarity between them.

最后，让我们讨论一下上面没有提到的CatBoost算法的两个选项。第一种方法是在boost过程的每次迭代时对数据集进行子抽样，该方法由Friedman[13]提出。我们在前面的4.1节中说过，仅靠这种方法无法完全避免预测偏移的问题。然而，由于它已经被证明是有效的，我们在两种CatBoost模式下作为贝叶斯引导过程来实现它。具体来说，在按照算法2训练一棵树之前，我们给每个例i分配一个权值wi = at i，其中at i是根据贝叶斯bootstrap方法生成的(见[28，第2节])。当我们计算∆(i)和向量∆−G的分量来定义损失(Tc)时，这些权重被用作梯度gradr(i)和gradr,j (i)的乘数。

第二个选项处理一个排列中的前几个示例。例如i的σr(i)值较小，gradr，σr(i)−1(i)的方差可以很高。因此，当我们在算法2中计算损失(Tc)时，我们从排列开始丢弃∆(i)。特别是，在计算向量G和∆之间的余弦相似度时，我们消除了它们的对应分量。

6 Experiments实验

Comparison with baselines与基线比较

Comparison with baselines We compare our algorithm with the most popular open-source libraries — XGBoost and LightGBM — on several well-known machine learning tasks. The detailed description of the experimental setup together with dataset descriptions is available in Appendix D. The source code of the experiment is available, and the results can be reproduced.7 For all learning algorithms, we preprocess categorical features using the ordered TS method described in Section 3.2. The parameter tuning and training were performed on 4/5 of the data and the testing was performed on the remaining 1/5.8 The results measured by logloss and zero-one loss are presented in Table 2 (the absolute values for the baselines are in Appendix G). For CatBoost, we used Ordered boosting mode in this experiment.9 One can see that CatBoost outperforms other algorithms on all the considered datasets. We also measured statistical significance of improvements presented in Table 2: except three datasets (Appetency, Churn and Upselling) the improvements are statistically significant with p-value 0.01 measured by the paired one-tailed t-test.

在几个著名的机器学习任务上，我们将我们的算法与最流行的开源库(XGBoost和LightGBM)进行比较。实验设置的详细描述和数据集描述可以在附录d中找到。实验的源代码可以找到，结果可以复制。7对于所有的学习算法，我们使用3.2节中描述的有序TS方法对类别特征进行预处理。参数调优和训练进行4/5的数据和测试进行其余1/5.8测量结果logloss和0 - 1表2中给出了损失(基线的绝对值在附录G)。CatBoost,在这个实验中我们使用命令推动模式。我们可以看到，CatBoost在所有考虑的数据集上的性能都优于其他算法。我们还测量了统计学意义的改进提出了表2:除了三个数据集(亲和力,生产和销售)的改进与假定值显著 0.01测量双单侧t检验。

To demonstrate that our implementation of plain boosting is an appropriate baseline for our research, we show that a raw setting of CatBoost provides state-of-the-art quality. Particularly, we take a setting of CatBoost, which is close to classical GBDT [12], and compare it with the baseline boosting implementations in Appendix G. Experiments show that this raw setting differs from the baselines insignificantly.

为了证明我们的纯提升实现是我们研究的适当基线，我们展示了CatBoost的原始设置提供了最先进的质量。特别地，我们选取了一个接近经典GBDT[12]的CatBoost设置，并将其与附录g中的基线提升实现进行了比较。实验表明，该原始设置与基线的差异不大。

We also empirically analyzed the running times of the algorithms on Epsilon dataset. The details of the comparison can be found in Appendix C.2. To summarize, we obtained that CatBoost Plain and LightGBM are the fastest ones followed by Ordered mode, which is about 1.7 times slower.

我们还对算法在Epsilon数据集上的运行时间进行了实证分析。比较的细节可以在附录C.2中找到。综上所述，我们得出CatBoost Plain和LightGBM是最快的，其次是Ordered模式，慢了约1.7倍。

Ordered and Plain modes有序和普通模式

Ordered and Plain modes In this section, we compare two essential boosting modes of CatBoost: Plain and Ordered. First, we compared their performance on all the considered datasets, the results are presented in Table 3. It can be clearly seen that Ordered mode is particularly useful on small datasets. Indeed, the largest benefit from Ordered is observed on Adult and Internet datasets, which are relatively small (less than 40K training examples), which supports our hypothesis that a higher bias negatively affects the performance. Indeed, according to Theorem 1 and our reasoning in Section 4.1, bias is expected to be larger for smaller datasets (however, it can also depend on other properties of the dataset, e.g., on the dependency between features and target). In order to further validate this hypothesis, we make the following experiment: we train CatBoost in Ordered and Plain modes on randomly filtered datasets and compare the obtained losses, see Figure 2. As we expected, for smaller datasets the relative performance of Plain mode becomes worse. To save space, here we present the results only for logloss; the figure for zero-one loss is similar.

We also compare Ordered and Plain modes in the above-mentioned raw setting of CatBoost in Appendix G and conclude that the advantage of Ordered mode is not caused by interaction with specific CatBoost options.

在本节中，我们比较了CatBoost的两种基本提升模式:普通模式和有序模式。首先，我们比较了它们在所有考虑的数据集上的性能，结果如表3所示。可以清楚地看到，有序模式在小数据集上特别有用。的确，Ordered最大的好处是在成人和互联网数据集上观察到的，这些数据集相对较小(小于40K训练示例)，这支持了我们的假设，即较高的偏差会对性能产生负面影响。实际上，根据定理1和我们在4.1节中的推理，对于较小的数据集，偏倚预计会更大(然而，它也可能依赖于数据集的其他属性，例如，依赖于特征和目标之间的依赖性)。为了进一步验证这一假设，我们做了如下实验:在随机过滤的数据集上，以有序模式和普通模式训练CatBoost，并比较得到的损耗，如图2所示。正如我们所预料的，对于较小的数据集，普通模式的相对性能变得更差。为了节省空间，这里我们只介绍logloss的结果;0 - 1损失的数字与此类似。

我们还在附录G中对上述CatBoost raw设置中的有序模式和普通模式进行了比较，得出的结论是有序模式的优势不是由与特定的CatBoost选项交互造成的。

Analysis of target statistics目标统计分析

We compare different TSs introduced in Section 3.2 as options of CatBoost in Ordered boosting mode keeping all other algorithmic details the same; the results can be found in Table 4. Here, to save space, we present only relative increase in loss functions for each algorithm compared to CatBoost with ordered TS. Note that the ordered TS used in CatBoost significantly outperform all other approaches. Also, among the baselines, the holdout TS is the best for most of the datasets since it does not suffer from conditional shift discussed in Section 3.2 (P1); still, it is worse than CatBoost due to less effective usage of training data (P2). Leave-one-out is usually better than the greedy TS, but it can be much worse on some datasets, e.g., on Adult. The reason is that the greedy TS suffer from low-frequency categories, while the leave-one-out TS suffer also from high-frequency ones, and on Adult all the features have high frequency.

Finally, let us note that in Table 4 we combine Ordered mode of CatBoost with different TSs. To generalize these results, we also made a similar experiment by combining different TS with Plain mode, used in standard gradient boosting. The obtained results and conclusions turned out to be very similar to the ones discussed above.

在保持所有其他算法细节相同的情况下，我们比较了3.2节中介绍的不同TSs作为有序提升模式下的CatBoost选项;结果见表4。在这里，为了节省空间，我们只给出了与使用有序TS的CatBoost相比，每种算法的损失函数的相对增加。注意，在CatBoost中使用的有序TS显著优于所有其他方法。此外，在基线中，holdout TS对于大多数数据集来说是最好的，因为它不会受到第3.2节(P1)中讨论的条件移位的影响;然而，由于训练数据的使用效率较低，它比CatBoost更差(P2)。Leave-one-out通常比greedy TS好，但在某些数据集上，例如在成年人上，它可能会糟糕得多。原因是greedy TS的特征属于低频类，而leave-one-out TS特征属于高频类，并且在成年人上所有特征都属于高频类。

最后，让我们注意到，在表4中，我们将CatBoost的有序模式与不同的TS结合在一起。为了推广这些结果，我们还做了一个类似的实验，将不同的TS与普通模式相结合，用于标准梯度增强。得到的结果和结论与上面讨论的结果和结论非常相似。

Feature combinations特征组合

The effect of feature combinations discussed in Section 5 is demonstrated in Figure 3 in Appendix G. In average, changing the number cmax of features allowed to be combined from 1 to 2 provides an outstanding improvement of logloss by 1.86% (reaching 11.3%), changing from 1 to 3 yields 2.04%, and further increase of cmax does not influence the performance significantly.

附录G中的图3展示了第5节中讨论的特征组合的效果。平均而言，将允许组合的特征的cmax数从1更改为2，可将对数损失显着提高1.86％（达到11.3％），从1更改为3可以得到2.04％，并且cmax的进一步增加不会显着影响性能。

Number of permutations排列数

The effect of the number s of permutations on the performance of CatBoost is presented in Figure 4 in Appendix G. In average, increasing s slightly decreases logloss, e.g., by 0.19% for s = 3 and by 0.38% for s = 9 compared to s = 1.

附录G中显示了排列数s对CatBoost性能的影响。平均而言，增加s会稍微降低对数损失，例如，s = 3比s = 1降低了0.19%，s = 9比s = 1降低了0.38%。

7 Conclusion结论

In this paper, we identify and analyze the problem of prediction shifts present in all existing implementations of gradient boosting. We propose a general solution, ordered boosting with ordered TS, which solves the problem. This idea is implemented in CatBoost, which is a new gradient boosting library. Empirical results demonstrate that CatBoost outperforms leading GBDT packages and leads to new state-of-the-art results on common benchmarks.

在本文中，我们确定并分析了梯度提升的所有现有实现中存在的预测偏移问题。我们提出了一种通用的解决方案，即使用有序TS进行有序增强，从而解决了该问题。该思想在新的梯度增强库CatBoost中得到了实现。实证结果表明，CatBoost的性能优于领先的GBDT软件包，并在通用基准上产生最新的最先进结果。

Acknowledgments

We are very grateful to Mikhail Bilenko for important references and advices that lead to theoretical analysis of this paper, as well as suggestions on the presentation. We also thank Pavel Serdyukov for many helpful discussions and valuable links, Nikita Kazeev, Nikita Dmitriev, Stanislav Kirillov and Victor Omelyanenko for help with experiments.

非常感谢Mikhail Bilenko为本文的理论分析提供的重要参考和建议，以及对本次演讲的建议。我们也感谢帕维尔·谢尔久科夫的许多有益的讨论和宝贵的链接，尼基塔·卡泽耶夫，尼基塔·德米特里耶夫，斯坦尼斯拉夫·基里洛夫和维克多·奥梅利亚年科对实验的帮助。

Paper：《CatBoost: unbiased boosting with categorical features》的翻译与解读（二）

Related work on prediction shift与预测偏移相关的工作

Analysis of prediction shift预测偏移的分析

Theorem 1 1 定理1 1

4.2 Ordered boosting有序提升

5 Practical implementation of ordered boosting有序提升的实际实现

Choosing leaf values选择叶子值

Complexity复杂性

Feature combinations特征组合

Other important details其他重要细节

6 Experiments实验

7 Conclusion结论

Acknowledgments

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Paper：《CatBoost: unbiased boosting with categorical features》的翻译与解读（二）

Related work on prediction shift与预测偏移相关的工作

Analysis of prediction shift预测偏移的分析

Theorem 1 1 定理1 1

4.2 Ordered boosting有序提升

5 Practical implementation of ordered boosting有序提升的实际实现

Choosing leaf values选择叶子值

Complexity复杂性

Feature combinations特征组合

Other important details其他重要细节

6 Experiments实验

7 Conclusion结论

Acknowledgments

热门文章

最新文章

相关电子书