Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(三)

简介: Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(三)

Learn Many Models, Not Just One  学习多种模型,而不仅仅是一种


In the early days of machine learning,  everyone had a favorite learner,  together with some a priori reasons  to believe in its superiority. Most effort  went into trying many variations  of it and selecting the best one. Then  systematic empirical comparisons  showed that the best learner varies  from application to application, and  systems containing many different  learners started to appear. Effort now  went into trying many variations of  many learners, and still selecting just  the best one. But then researchers  noticed that, if instead of selecting  the best variation found, we combine  many variations, the results are better—often  much better—and at little  extra effort for the user.  

Creating such model ensembles is  now standard.1   In the simplest technique,  called bagging, we simply generate  random variations of the training  set by resampling, learn a classifier  on each, and combine the results by  voting. This works because it greatly  reduces variance while only slightly  increasing bias. In boosting, training  examples have weights, and these are  varied so that each new classifier focuses  on the examples the previous  ones tended to get wrong. In stacking,  the outputs of individual classifiers  become the inputs of a “higher-level”  learner that figures out how best to  combine them.  在机器学习的早期,每个人都有一个喜欢的学习者,加上一些先验的理由相信它的优越性。最努力的尝试是尝试它的多种变体并选择最佳的一种。然后系统的经验比较表明,最佳学习者因应用程序而异,并且包含许多不同学习者的系统开始出现。现在,我们努力尝试许多学习者的许多变体,但仍然只选择最好的一个。但是随后研究人员注意到,如果我们不选择发现的最佳变体,而是结合许多变体,则结果会更好-通常更好得多-并且对用户来说几乎没有额外的精力。

创建这样的模型集成现在是标准的.1。在最简单的技术(称为装袋)中,我们只需通过重新采样就可以生成训练集的随机变化,在每个学习一个分类器,然后通过投票合并结果。之所以行之有效,是因为它大大减少了方差,而偏差却稍有增加。在增强方面,训练示例具有权重,并且这些权重是可变的,因此每个新分类器都将重点放在示例上,而先前的那些往往会出错。在堆叠中,单个分类器的输出成为一个``高级''学习器的输入,该学习器找出了如何最好地组合它们。

Many other techniques exist, and  the trend is toward larger and larger  ensembles. In the Netflix prize, teams  from all over the world competed to  build the best video recommender system (http://netflixprize.com). As  the competition progressed, teams  found they obtained the best results  by combining their learners with other  teams’, and merged into larger and  larger teams. The winner and runnerup  were both stacked ensembles of  over 100 learners, and combining the  two ensembles further improved the  results. Doubtless we will see even  larger ones in the future.  

Model ensembles should not be  confused with Bayesian model averaging  (BMA)—the theoretically  optimal approach to learning.4   In  BMA, predictions on new examples  are made by averaging the individual  predictions of all classifiers in the  hypothesis space, weighted by how  well the classifiers explain the training  data and how much we believe  in them a priori. Despite their superficial  similarities, ensembles and  BMA are very different. Ensembles  change the hypothesis space (for example,  from single decision trees to  linear combinations of them), and  can take a wide variety of forms. BMA  assigns weights to the hypotheses in  the original space according to a fixed  formula. BMA weights are extremely  different from those produced by  (say) bagging or boosting: the latter  are fairly even, while the former are  extremely skewed, to the point where  the single highest-weight classifier  usually dominates, making BMA effectively  equivalent to just selecting  it.8   A practical consequence of this is  that, while model ensembles are a key  part of the machine learning toolkit,  BMA is seldom worth the trouble. 存在许多其他技术,并且趋势正在越来越大。在Netflix奖项中,来自世界各地的团队竞争建立了最佳的视频推荐系统(http://netflixprize.com)。随着比赛的进行,团队发现他们通过将学习者与其他``团队''结合在一起而获得了最佳成绩,并合并为越来越大的团队。获胜者和亚军都是超过100名学习者的堆叠乐团,并且将这两个乐团结合起来可以进一步改善结果。毫无疑问,我们将来会看到更大的机型。

模型集合不应与贝叶斯模型平均(BMA)混淆-理论上的最佳学习方法.4在BMA中,对新示例的预测是通过对假设空间中所有分类器的各个预测取平均,并通过对分类器加权的方式得出的解释训练数据以及我们对它们有先验的信任度。尽管它们在表面上有相似之处,但合奏和BMA却有很大不同。集合更改假设空间(例如,从单个决策树更改为它们的线性组合),并且可以采用多种形式。 BMA根据固定公式将权重分配给原始空间中的假设。 BMA权重与(例如)装袋或提升产生的权重极为不同:后者相当均匀,而前者则极为偏斜,以至于单个最高权重的分类器通常占主导地位,这使得BMA有效地等同于仅选择它.8的实际结果是,虽然模型集成是机器学习工具包的关键部分,但BMA很少值得为此烦恼。



Simplicity Does Not  Imply Accuracy  简易性不准确


Just because  a function can  be represented  does not mean  it can be learned. 仅仅因为可以表示一个函数并不意味着可以学习它。

Occam’s razor famously states that  entities should not be multiplied beyond  necessity. In machine learning,  this is often taken to mean that, given  two classifiers with the same training  error, the simpler of the two will likely  have the lowest test error. Purported  proofs of this claim appear regularly  in the literature, but in fact there are  many counterexamples to it, and the  “no free lunch” theorems imply it cannot  be true.  We saw one counterexample previously:  model ensembles. The generalization  error of a boosted ensemble continues to improve by adding classifiers  even after the training error has  reached zero. Another counterexample  is support vector machines, which  can effectively have an infinite number  of parameters without overfitting.  Conversely, the function sign(sin(ax))  can discriminate an arbitrarily large,  arbitrarily labeled set of points on the  x axis, even though it has only one parameter.23  Thus, contrary to intuition,  there is no necessary connection between  the number of parameters of a  model and its tendency to overfit.   奥卡姆(Occam)的剃刀著名地指出,实体不应超出必需的数量。在机器学习中,这通常是指给定两个具有相同训练误差的分类器,两个简单的分类器可能具有最低的测试误差。据称该主张的证据在文献中经常出现,但实际上有很多反例,并且``无免费午餐''定理表明它不成立。之前我们看到了一个反例:模型集成。甚至在训练误差达到零之后,通过添加分类器,仍可以提高增强的合奏的泛化误差。另一个反例是支持向量机,它可以有效地具有无限数量的参数而不会过度拟合。相反,功能符号(sin(ax))可以区分x轴上任意大的,带有标签的点集,尽管它只有一个参数.23因此,与直觉相反,数字之间没有必要的联系模型的参数及其过度拟合的趋势。

A more sophisticated view instead  equates complexity with the size of  the hypothesis space, on the basis that  smaller spaces allow hypotheses to be  represented by shorter codes. Bounds  like the one in the section on theoretical  guarantees might then be viewed  as implying that shorter hypotheses  generalize better. This can be further  refined by assigning shorter codes to  the hypotheses in the space we have  some a priori preference for. But  viewing this as “proof” of a trade-off  between accuracy and simplicity is  circular reasoning: we made the hypotheses  we prefer simpler by design,  and if they are accurate it is because  our preferences are accurate, not because  the hypotheses are “simple” in  the representation we chose.  

A further complication arises from  the fact that few learners search their  hypothesis space exhaustively. A  learner with a larger hypothesis space  that tries fewer hypotheses from it  is less likely to overfit than one that  tries more hypotheses from a smaller  space. As Pearl18 points out, the size of  the hypothesis space is only a rough  guide to what really matters for relating  training and test error: the procedure  by which a hypothesis is chosen.  

Domingos7   surveys the main arguments  and evidence on the issue of  Occam’s razor in machine learning.  The conclusion is that simpler hypotheses  should be preferred because  simplicity is a virtue in its own right,  not because of a hypothetical connection  with accuracy. This is probably  what Occam meant in the first place.

相反,一个更复杂的视图将复杂度与假设空间的大小等同起来,其依据是较小的空间允许用较短的代码表示假设。像理论保证一节中所述的界限可能会被认为暗示着较短的假设通常会更好。可以通过为我们具有某些先验偏好的空间中的假设分配较短的代码来进一步完善。但是,将其视为在准确性和简单性之间进行权衡的``证明''是循环推理:我们通过设计使假设变得更简单,如果假设是准确的,那是因为我们的偏好是准确的,而不是因为假设是``简单的''在我们选择的表示形式中。

更为复杂的是由于几乎没有学习者详尽地搜索其假设空间这一事实。拥有较大假设空间的学习者从中尝试较少的假设的可能性比从较小空间尝试更多假设的学习者的过拟合可能性小。正如Pearl18所指出的那样,假设空间的大小仅是对与训练和测试误差有关的真正重要性的粗略指导:选择假设的过程。 Domingos7调查了有关机器学习中Occam剃刀问题的主要论点和证据。结论是,应采用更简单的假设,因为简单性本身就是一种美德,而不是因为假设与准确性之间的联系。这可能是Occam首先的意思。


Representable Does Not  Imply Learnable  有代表性的不容易学习


Essentially all representations used in  variable-size learners have associated theorems of the form “Every function  can be represented, or approximated  arbitrarily closely, using this representation.”  Reassured by this, fans of  the representation often proceed to  ignore all others. However, just because  a function can be represented  does not mean it can be learned. For  example, standard decision tree learners  cannot learn trees with more leaves  than there are training examples. In  continuous spaces, representing even  simple functions using a fixed set of  primitives often requires an infinite  number of components. Further, if  the hypothesis space has many local  optima of the evaluation function, as  is often the case, the learner may not  find the true function even if it is representable.  Given finite data, time and  memory, standard learners can learn  only a tiny subset of all possible functions,  and these subsets are different  for learners with different representations.  Therefore the key question is  not “Can it be represented?” to which  the answer is often trivial, but “Can it  be learned?” And it pays to try different  learners (and possibly combine them).  

Some representations are exponentially  more compact than others for  some functions. As a result, they may  also require exponentially less data to  learn those functions. Many learners  work by forming linear combinations  of simple basis functions. For example,  support vector machines form  combinations of kernels centered at  some of the training examples (the  support vectors). Representing parity  of n bits in this way requires 2n   basis  functions. But using a representation  with more layers (that is, more steps  between input and output), parity can  be encoded in a linear-size classifier.  Finding methods to learn these deeper  representations is one of the major research  frontiers in machine learning.2

基本上,可变大小学习器中使用的所有表示形式都有相关的定理,形式为``可以使用该表示形式来表示或任意近似地逼近每个函数''。对此感到放心的是,代表制的支持者经常会忽略其他所有人。但是,仅仅因为可以表示一个函数并不意味着可以学习它。例如,标准决策树学习者无法学习叶子多于训练实例的树。在连续空间中,使用一组固定的基元表示甚至简单的函数通常需要无限数量的组件。此外,如果假设空间具有很多评估函数的局部最优值(通常是这样),则学习者即使可以表示,也可能找不到真正的函数。给定有限的数据,时间和内存,标准学习者只能学习所有可能功能的一小部分,而对于具有不同表示形式的学习者来说,这些子集是不同的。因此,关键问题不是“能否代表?”答案通常是微不足道的,但是“可以学习吗?”尝试不同的学习者(并可能将他们合并)是值得的。

对于某些功能,某些表示形式比其他表示形式更紧凑。结果,他们可能还需要以指数形式减少的数据来学习这些功能。许多学习者通过形成简单基函数的线性组合来工作。例如,支持向量机形成了以一些训练示例(支持向量)为中心的内核组合。以这种方式表示n位的奇偶校验需要2n个基函数。但是使用具有更多层的表示(即输入和输出之间的更多步骤),可以将奇偶校验编码为线性大小的分类器。寻找学习这些更深层表示的方法是机器学习的主要研究领域之一.2


Correlation Does Not  Imply Causation  关联不表示因果关系


The point that correlation does not  imply causation is made so often that  it is perhaps not worth belaboring.  But, even though learners of the kind  we have been discussing can only  learn correlations, their results are  often treated as representing causal  relations. Isn’t this wrong? If so, then  why do people do it?

More often than not, the goal  of learning predictive models is to  use them as guides to action. If we  find that beer and diapers are often  bought together at the supermarket,  then perhaps putting beer next  to the diaper section will increase  sales. (This is a famous example in  the world of data mining.) But short  of actually doing the experiment it is  difficult to tell. Machine learning is  usually applied to observational data,  where the predictive variables are not  under the control of the learner, as  opposed to experimental data, where  they are. Some learning algorithms  can potentially extract causal information  from observational data, but  their applicability is rather restricted.19  On the other hand, correlation  is a sign of a potential causal connection,  and we can use it as a guide to  further investigation (for example,  trying to understand what the causal  chain might be).   关联并不意味着因果关系如此频繁,这一点也许值得我们去研究。但是,即使我们一直在讨论的那种学习者只能学习相关性,他们的结果也经常被视为代表因果关系。这不是错吗?如果是这样,那么人们为什么这样做?

通常,学习预测模型的目的是将其用作行动指南。如果我们发现啤酒和尿布经常是在超市一起买的,那么也许把啤酒放在尿布下面会增加销量。 (这是数据挖掘世界中的一个著名示例。)但是实际上很难进行实验,很难说清楚。机器学习通常应用于观测数据,而预测变量不在学习者的控制之下,而与实验数据相反。一些学习算法可能会从观测数据中提取因果信息,但其适用性受到限制.19另一方面,相关性是潜在因果关系的标志,我们可以将其用作进一步调查的指南(例如,试图了解原因链可能是什么)。

Many researchers believe that causality  is only a convenient fiction. For  example, there is no notion of causality  in physical laws. Whether or not  causality really exists is a deep philosophical  question with no definitive  answer in sight, but there are two  practical points for machine learners.  First, whether or not we call them  “causal,” we would like to predict the  effects of our actions, not just correlations  between observable variables.  Second, if you can obtain experimental  data (for example by randomly assigning  visitors to different versions of  a Web site), then by all means do so.14   许多研究人员认为因果关系只是一种方便的小说。 例如,物理定律中没有因果关系的概念。 因果关系是否真的存在是一个深刻的哲学问题,没有明确的答案,但是对于机器学习者来说有两个实践点。 首先,无论我们是否称其为``因果关系'',我们都希望预测行为的影响,而不仅仅是可观察变量之间的相关性。 其次,如果您可以获得实验数据(例如通过将访问者随机分配给网站的不同版本),则一定要这样做.14


Conclusion  结论


Like any discipline, machine learning  has a lot of “folk wisdom” that can  be difficult to come by, but is crucial  for success. This article summarized  some of the most salient items. Of  course, it is only a complement to the  more conventional study of machine  learning. Check out http://www.  cs.washington.edu/homes/pedrod/  class for a complete online machine  learning course that combines formal  and informal aspects. There is also a  treasure trove of machine learning  lectures at http://www.videolectures.  net. A good open source machine  learning toolkit is Weka.24  

Happy learning! 像任何学科一样,机器学习具有许多难以获得的``民间智慧'',但对于成功至关重要。 本文总结了一些最突出的项目。 当然,它只是对机器学习的更常规研究的补充。 查看http:// www。 cs.washington.edu/homes/pedrod/上一堂完整的在线机器学习课程,该课程结合了正式和非正式方面。 在http://www.videolectures上还有一个机器学习宝库。 净。 一个很好的开源机器学习工具包是Weka.24祝您学习愉快!

相关文章
|
6月前
|
机器学习/深度学习 自然语言处理 对象存储
[wordpiece]论文分析:Google’s Neural Machine Translation System
[wordpiece]论文分析:Google’s Neural Machine Translation System
80 1
|
机器学习/深度学习 算法 Shell
Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(二)
Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(二)
|
机器学习/深度学习 存储 传感器
Automated defect inspection system for metal surfaces based on deep learning and data augmentation
简述:卷积变分自动编码器(CVAE)生成特定的图像,再使用基于深度CNN的缺陷分类算法进行分类。在生成足够的数据来训练基于深度学习的分类模型之后,使用生成的数据来训练分类模型。
153 0
《NATURAL LANGUAGE UNDERSTANDING WITH MACHINE ANNOTATORS & DEEP LEARNED ONTOLOGIES AT SCALE》电子版地址
NATURAL LANGUAGE UNDERSTANDING WITH MACHINE ANNOTATORS & DEEP LEARNED ONTOLOGIES AT SCALE
94 0
《NATURAL LANGUAGE UNDERSTANDING WITH MACHINE ANNOTATORS & DEEP LEARNED ONTOLOGIES AT SCALE》电子版地址
|
机器学习/深度学习 移动开发 自然语言处理
Paper:《Graph Neural Networks: A Review of Methods and Applications》翻译与解读
Paper:《Graph Neural Networks: A Review of Methods and Applications》翻译与解读
Paper:《Graph Neural Networks: A Review of Methods and Applications》翻译与解读
|
机器学习/深度学习 算法 搜索推荐
Paper:《A Few Useful Things to Know About Machine Learning—关于机器学习的一些有用的知识》翻译与解读
Paper:《A Few Useful Things to Know About Machine Learning—关于机器学习的一些有用的知识》翻译与解读
|
SQL 编译器 API
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
这应该是SQL查询编译的一篇经典文章了,作者是著名的Thomas Neumann,主要讲解了TUM的HyPer数据库中对于CodeGen的应用。 在morsel-driven那篇paper 中,介绍了HyPer的整个执行框架,会以task为单位处理一个morsel的数据,而执行的处理逻辑(一个pipeline job)就被编译为一个函数。这篇paper则具体讲如何实现动态编译。
438 0
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
|
机器学习/深度学习 算法 搜索推荐
Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(一)
Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读(一)
|
机器学习/深度学习 数据可视化 数据挖掘
Paper:《Graph Neural Networks: A Review of Methods and Applications》解读(一)
Paper:《Graph Neural Networks: A Review of Methods and Applications》