(zhuan) Notes on Representation Learning

简介: this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/ Notes on Representation LearningBy Zac Kriegman, Senior D...
this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/
 

Notes on Representation Learning

 

Tags:  , 

 

TL;DR: Representation learning can eliminate the need for large labeled data sets to train deep neural networks, opening up new domains to machine learning and transforming the practice of Data Science.

Check out “Notes on Representation Learning” in these three parts.

  1. Notes on Representation Learning
  2. Notes on Representation Learning Continued
  3. Representation Learning Bonus Material

Deep Learning and Labeled Datasets

The greatest strength of Deep Learning (DL) is also one of its biggest weaknesses. DL models frequently have many millions of parameters. The extreme number of parameters—compared to other sorts of machine learning models—gives DL models tremendous flexibility to learn arbitrarily complex functions that simpler models cannot learn. But this flexibility makes it very easy to “overfit” on a training set (essentially, memorize specific examples instead of learning underlying patterns that allow generalization to examples not in the training set).

The conceptually simplest way to prevent overfitting is to train on very large datasets.  If the dataset is big in relation to the number of parameters, then the network will not have enough capacity to memorize examples and will be “forced” to instead learn underlying patterns when optimizing a loss function.  But creating large, labeled datasets for every task we want to perform is cost prohibitive (and may even be impossible if the goal is general purpose intelligent agents).

This need for large training-sets is often the biggest obstacle to apply DL to real world problems. On small datasets, other types of models can outperform DL to the extent that the constraints of those models match the task at hand. For instance, if there is a simple linear relationship in the data, a linear regression can greatly outperform a DL model trained on a small dataset because the linear constraint of the model corresponds to the data.

Figure 1. Neural Nets have a tendency to over fit when datasets are too small.  Here the true relationship between the height and weight of an animal and whether it is a dog or cat, is essentially linear.  A linear classifier assumes this relationship and uses the data merely to determine the slope and intercept.  A large neural network will require much more data to learn a straight-line partition.  With a small dataset, relative to the size of the neural network it will overfit on unusual examples, reducing predictive performance. (Source: https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png)

 

That correspondence allows the model to learn from a small dataset much more efficiently than a DL model because a DL model needs to learn the linear relationship whereas a linear regression simply assumes it. Simple linear classifiers are sufficient for a simple problem like that illustrated above, however more complex problems require models capable of modeling complex relationships within the data.  Much of the work in applying machine learning involves choosing models with constraints and power that match the dataset.  While DL has dramatically outperformed all other models on many tasks, to a large extent it has only done so for complex problem where there are big labeled datasets available for training.

 

Representation Learning

This blog post describes how the need for large, labeled datasets to train DL models is coming to an end. Over the last year there have been many research results demonstrating how DL models can learn much more efficiently than other models—outperforming alternatives even with very small labeled training sets. Indeed, in some remarkable cases, described below, DL models can learn to perform complex tasks with only a single labeled exampled (“one shot learning”) or even without any labeled data at all (“zero-shot learning”). Over the next few years, these research results will be rolled out to production systems, and further innovations will continue to improve data efficiency even more.

 

The key to this progress is what DL researchers call “representation learning“—a topic considered so important that prominent researchers named the premier DL conference the International Conference on Learning Representations. Part of the enthusiasm for learning representations is that rather than training DL models on labeled data specific to a target task, you can train them on labeled data for a different problem, or more importantly, on unlabeled data. In the process of training on unlabeled data, the model builds up a reusable internal representation of the data. For instance, in an image classification example (further described below), a network first learns to generate bedroom scenes. To do this convincingly it must develop an internal representation of the world: its 3-dimensional structure, visual perspective, interior design, typical bedroom furniture, etc. In other words, using unsupervised learning (on unlabeled data) the model builds an understanding of how the world of bedrooms actually works to produce pictures of bedrooms. Once a network has an internal representation like this it can learn to recognize objects in images much more easily.  Learning to recognize a “bed” could become almost as simple as learning to associate the word “bed” with an object that the network already knows a lot about—it’s 3-dimensional shape, colors, location in rooms, typical surrounding furniture, etc.  As a result, instead of needing hundreds or thousands of labeled examples of objects, the model could learn from just a handful of examples.

Figure 2.  Bedroom scene generated by a DL model.  No information about bedrooms, bedroom furniture, lighting, visual perspective, etc. was programmed into the network but it learns enough about those things to produce realistic looking images and plausible bedroom arrangements purely by training on bedroom images.  (Source: https://arxiv.org/pdf/1511.06434v2.pdf)

Breakthroughs in representation learning herald a sea-change in machine learning that will help unlock the insights of the big-data era. Today data scientists work by carefully choosing machine learning models with constraints that match their problem domain and then painstakingly tuning those models to squeeze out every last drop of learning available from small labeled datasets. Over the coming years that workflow will move to selecting DL models pre-trained on enormous unlabeled datasets to build up internal representations, and then training on just a handful of labeled datasets examples to solve the task at hand.  Instead of just choosing the right model, machine learning practitioners will choose a model and a prepackaged representation already trained up on related data. This workflow is already common in image recognition, where deep learning has been dominant for some time, and in certain NLP domains, like parsing, and is spreading to other domains.

 

As we continue to transition to this new paradigm, the number of problems we can solve with machine learning will explode.  Right now, we are bumping up against the limits of what simple, highly constrained models with no learned internal representation of the world can accomplish. We can’t squeeze more blood from that stone; big jumps in capability will instead come from models that have some understanding of the world and that can thus interpret data within a larger, more meaningful context. The way forward is not magic new machine learning models that can squeeze more accuracy out of labeled datasets without any understanding of the world those datasets come from, but rather pre-trained models that bring an understanding of the world in which they are operating.

 

Examples of Recent Representation Learning Progress

Here I describe some of the remarkable advances being made in representation learning and how they are increasing the data efficiency of DL models.  In all of these examples, DL models were able to learn with much less labeled data than simpler alternatives require.  Though this is a small sample, I’ve tried to select examples across different problem domains (natural language processing, image classification, and intelligent agents) and learning types (supervised, semi-supervised, unsupervised, and reinforcement) to illustrate the variety of approaches which are seeing impressive success.

 

Transfer Learning with Progressive Neural Networks

Progressive Neural Networks (“PNNs”) are DL models specially modified to be able to (1) learn multiple tasks from different datasets in sequence without forgetting tasks learned earlier in the sequence and (2) reuse the representations learned from earlier tasks to accelerate the learning of subsequent tasks. Reusing representations like this from one task to another in order to accelerate subsequent learning is called “transfer learning”, a recurring theme in the examples below.  Progressive Neural Networks employ transfer learning to improve data efficiency when learning new tasks—i.e. new tasks are learned with much less labeled data.

 

To understand the value of transfer learning, you could imagine, for example, a convolutional network learning low level features like edges that are aggregated up to parts of faces like ears and mouths, and ultimately to whole faces.  Later, you may retrain that network on a different visual recognition task, perhaps recognition of cars. In that case, the network may be able to mostly reuse the low-level edge features while overwriting the higher-level features to aggregate edges into car parts instead of into face parts.

Figure 3. (Source: http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf)

Retraining a normal convolutional network like this, and, in the process, overwriting some previously learned features, is called fine-tuning.  While transfer learning by fine-tuning has seen very successful application, it has some important drawbacks. Most importantly, we may want to transfer knowledge from multiple tasks to a new task. However, during fine-tuning, the ability to do the first task can be catastrophically forgotten when learning the second one. Imagine that after training a model on faces and then cars, as in the example above, you subsequently want to train the model to recognizing people while in their cars (perhaps for a traffic enforcement application). Many of the features that aggregate edges into human facial features like eyes, ears, mouth, could have been re-used, but unfortunately they were destroyed when learning car features. This means that when learning a third task the network may not be able to draw on useful features learned in the first task. The goal behind PNNs is to have a network that can continue learning from diverse datasets, continually expanding its knowledge as it goes.

 

If you already understand simple Feedforward Neural Networks, PNNs are not hard to conceptualize. Each task the network learns is allocated a “column” of the network, which is a full multi-layered feedforward network. After learning a task, the associated column is frozen so that it cannot be affected by training on future tasks, and a new column is added for the next task. Each layer in the new column gets input not only from lower layers within the new column, but also from lower layers within the frozen columns previously trained on other tasks. This allows the network to take advantage of features it has learned for other tasks, and repurpose them for new tasks without losing knowledge about the previous tasks.

 

Figure 4.  (Source: https://arxiv.org/pdf/1606.04671v3.pdf)

 

The result is a network architecture that can often learn new skills with much less training data than a network learning from scratch, or even than a network pre-trained on one previous task and then fined-tuned.  The authors of the PNN paper demonstrated dramatic improvements in the data efficiency of their AI agents:

Figure 5.  Tests of Progressive Neural Networks on variations of the Atari game Pong, illustrating how they learn more efficiently compared to two baselines: Base1, a single column trained on the target task, and Base3, a single column pre-trained and fine-tuned on. (Source: https://arxiv.org/pdf/1606.04671v3.pdf)

Zero Shot Natural Language Translation

Recently there have also been some great examples of transfer learning in the NLP space.  A couple of months ago Google announced that it is rolling out DL models for machine translation—called Google Neural Machine Translation (GNMT)—to replace the phrase based models that used to be state-of-the-art. GNMT models use a pair of recurrent neural networks: (1) an encoder that reads in words one at a time and produces a series of vectors representing all words read to that point, and (2) a decoder which reads the encoded vectors and outputs the translation (with an attention mechanism allowing the decoder to focus on the most important encoded vectors for each word it outputs).  This method resulted in dramatic translation improvements for all language pairs, in some cases approaching human level.

Figure 6. According to their paper GMNT “reduced translation errors by an average of 60% compared to Google’s phrase-based production system.” (Source: https://research.googleblog.com/2016/09/a-neural-network-for-machine.html)

A few weeks ago, Google researchers published an impressive paper describing how they made a trivial modification of their GNMT architecture that allowed them to use a single network to translate all language pairs, instead of training a separate network for each language pair.  To accomplish this, they simply modified their network to accept a token representing which language pair was being translated, and then trained on multiple language pairs at once.  This token provides the additional information that the decoder network needs in order to output the appropriate language.  Not only were they able to train a single network to translate between many different languages, but they used the same size network as they would normally use for a single language pair, thereby dramatically reducing the number of parameters used for the entire collection of languages.

 

The really interesting part is that after training on many language pairs, the network was able to translate between language pairs it had never seen or been trained on. In other words, it achieved “zero-shot” translation.  The implication is that after training on a number of language pairs, the network develops its own “universal interlingua representation” of the meaning of source sentences independent of the source language.  Once it has this representation of the meaning of the sentence it can translate it to any target language it knows about, regardless of whether it has ever seen the source-target combination.

 

To verify that the neural network actually creates this interlingua representation, the authors used t-SNE to plot a 2-dimensional representation of the intermediate vectors connecting the encoding and decoding networks.  Below, in figure (7a) each color represents the intermediate vectors produced when translating semantically identical sentences in English, Korean and Japanese.  (Each vector is a dot, and vectors produced in a series as part of translating a single sentence from one language are connected by a line.)  The fact that similarly colored (and thus semantically identical) sentences are clustered near each other illustrates that the neural network has understood them to have similar meanings and therefore produces similar intermediate vectors (in its interlingua representation).  Figure (7b) zooms in on one example, and figure (7c) re-colors that example to distinguish between the semantically identical sentences in the three different languages.

Figure 7.  (Source: https://arxiv.org/pdf/1611.04558v1.pdf%5D)

 

The takeaway here is that the network developed an internal representation of the problem domain—of the meaning (semantics) of the sentences represented independently of the particular vocabulary or grammar of a language.  That representation turned out to be so rich that it enabled the network to translate between language pairs with no labeled training data.  The network transferred its learning from language pairs it had seen to pairs that it had never seen before.

 

The second in the series is available here, and some bonus material here.


©ODSC2017

 

 
相关文章
|
机器学习/深度学习 算法 数据挖掘
阿里音乐流行趋势预测—冠军答辩(一)|学习笔记
快速学习阿里音乐流行趋势预测—冠军答辩(一)
1152 0
|
Go
【go 语言】PProf 的使用——协程(goroutine)和锁(mutex)分析(三)
【go 语言】PProf 的使用——协程(goroutine)和锁(mutex)分析(三)
3219 0
【go 语言】PProf 的使用——协程(goroutine)和锁(mutex)分析(三)
|
8月前
|
缓存 供应链 监控
1688item_search_factory - 按关键字搜索工厂数据接口深度分析及 Python 实现
item_search_factory接口专为B2B电商供应链优化设计,支持通过关键词精准检索工厂信息,涵盖资质、产能、地理位置等核心数据,助力企业高效开发货源、分析产业集群与评估供应商。
|
5月前
|
机器学习/深度学习 人工智能 弹性计算
阿里云服务器租用价格:最新包年包月、按量付费活动价格参考
阿里云服务器租用价格又更新了,租用阿里云轻量应用服务器一年价格是38元,经济型e实例2核2G3M带宽 40G ESSD Entry云盘特惠价99元1年,通用算力型u1实例2核4G5M带宽80G ESSD Entry云盘特惠价199元1年。通用算力型u2i实例4核8G1170.26元1年起。本文为大家展示本次价格更新之后,阿里云服务器的最新租用价格,包含经济型e、通用算力型u2i/u2a、计算型c9i/c9a、通用型g9i/g9a、内存型r9i/r9a等不同实例规格的活动价格,以供大家对比和选择参考。
982 13
|
8月前
|
人工智能 JSON 算法
向量嵌入的天花板与AI检索的模式更迭
本文提出突破传统“单向量嵌入+ANN”检索范式,构建多结构协同的下一代AI检索框架。通过多通道嵌入、组合键兜底、知识图推理、程序化计划与生成-校验闭环,实现高可信、可解释、可验证的智能检索,应对复杂任务中的信息漏检与推理难题,推动RAG迈向结构化、可编程的认知系统。
324 12
|
监控 数据可视化 搜索推荐
淘宝 API 接口的实际应用案例
淘宝API接口在电商领域扮演着重要角色,支持多种应用场景。一是助力电商数据分析公司获取商品详情,分析市场趋势、竞品及提出产品改进建议。二是支持商品比价网站收集价格信息,帮助消费者高效比价。三是赋能营销工具开发商实现自动化营销,提升客户转化率。四是协助跨境电商平台引进热门商品并进行本地化处理。最后,支持数据可视化工具展示销售趋势,辅助商家决策。
|
机器学习/深度学习 编解码 物联网
极致的显存管理!6G显存运行混元Video模型
混元 Video 模型自发布以来,已成为目前效果最好的开源文生视频模型,然而,这个模型极为高昂的硬件需求让大多数玩家望而却步。魔搭社区的开源项目 DiffSynth-Studio 近期为混元 Video 模型提供了更高效的显存管理的支持,目前已支持使用24G显存进行无任何质量损失的视频生成,并在极致情况下,用低至 6G 的显存运行混元 Video 模型!
1702 14
|
存储 缓存 负载均衡
高并发系统架构的设计挑战与应对策略
【8月更文挑战第18天】高并发系统架构设计是一项复杂而重要的任务。面对性能瓶颈、稳定性与可靠性、并发控制和可扩展性等挑战,开发人员需要采取一系列有效的策略和技术手段来应对。通过负载均衡、缓存技术、数据库优化、异步处理、并发控制、弹性设计及监控与调优等手段,可以设计出高性能、高可用和高可扩展性的高并发系统架构,为用户提供优质的服务体验。
|
网络协议 Linux 网络安全
|
机器学习/深度学习 算法 数据挖掘
SCENIC 识别转录因子调控网络原理分享
本分分享了关于学习参考多篇 介绍SCENIC 软件分析原理的博客和文献后总结的个人关于 SCENIC 识别转录因子调控网络原理的理解,以供参考学习
1731 0

热门文章

最新文章