编程语言，算法相关技术专家
一文读懂 深度强化学习算法 A3C （Actor-Critic Algorithm） 2017-12-25 16:29:19 对于 A3C 算法感觉自己总是一知半解，现将其梳理一下，记录在此，也给想学习的小伙伴一个参考。 想要认识清楚这个算法，需要对 DRL 的算法有比较深刻的了解，推荐大家先了解下 Deep Q-learning 和 Policy Gradient 算法。 我们知道，DRL 算法大致可以分为如下这几个类别：Value Based and Policy Based，其经典算法分别为：Q-learning 和 Policy Gradient Method。 而本文所涉及的 A3C 算法则是结合 Policy 和 Value Function 的产物，其中，基于 Policy 的方法，其优缺点总结如下： Advantages: 1. Better convergence properties （更好的收敛属性） 2. Effective in high-dimensional or continuous action spaces（在高维度和连续动作空间更加有效） 3. Can learn stochastic policies（可以Stochastic 的策略） Disadvantages: 1. Typically converge to a local rather than global optimum（通常得到的都是局部最优解） 2. Evaluating a policy is typically inefficient and high variance （评价策略通常不是非常高效，并且有很高的偏差） 我们首先简要介绍一些背景知识（Background）： 在 RL 的基本设置当中，有 agent，environment, action, state, reward 等基本元素。agent 会与 environment 进行互动，而产生轨迹，通过执行动作 action，使得 environment 发生状态的变化，s -> s' ；然后 environment 会给 agent 当前 动作选择以 reward（positive or negative）。通过不断的进行这种交互，使得积累越来越多的 experience，然后更新 policy，构成这个封闭的循环。为了简单起见，我们仅仅考虑 deterministic environment，即：在状态 s 下，选择 action a 总是会得到相同的 状态 s‘。 为了清楚起见，我们先定义一些符号： 1. stochastic policy $\pi(s)$ 决定了 agent's action, 这意味着，其输出并非 single action，而是 distribution of probability over actions (动作的概率分布)，sum 起来为 1. 2. $\pi(a|s)$ 表示在状态 s 下，选择 action a 的概率； 而我们所要学习的策略 $\pi$，就是关于 state s 的函数，返回所有 actions 的概率。 我们知道，agent 的目标是最大化所能得到的奖励（reward），我们用 reward 的期望来表达这个。在概率分布 P 当中，value X 的期望是： 其中 Xi 是 X 的所有可能的取值，Pi 是对应每一个 value 出现的概率。期望就可以看作是 value Xi 与 权重 Pi 的加权平均。 这里有一个很重要的事情是： if we had a pool of values X, ratio of which was given by P, and we randomly picked a number of these, we would expect the mean of them to be . And the mean would get closer to as the number of samples rise. 我们再来定义 policy $\pi$ 的 value function V(s)，将其看作是 期望的折扣回报 (expected discounted return)，可以看作是下面的迭代的定义： 这个函数的意思是说：当前状态 s 所能获得的 return，是下一个状态 s‘ 所能获得 return 和 在状态转移过程中所得到 reward r 的加和。 此外，还有 action value function Q(s, a)，这个和 value function 是息息相关的，即： 此时，我们可以定义一个新的 function A(s, a) ，这个函数称为 优势函数（advantage function）: 其表达了在状态 s 下，选择动作 a 有多好。如果 action a 比 average 要好，那么，advantage function 就是 positive 的，否则，就是 negative 的。 Policy Gradient： 当我们构建 DQN agent 的时候，我们利用 NN 来估计的是 Q(s, a) 函数。这里，我们采用不同的方法来做，既然 policy $\pi$ 是 state $s$ 的函数，那么，我们可以直接根据 state 的输入 来估计策略的选择嘛。 这里，我们 NN 的输入是 state s，输出是 an action probability distribution $\pi_\theta$，其示意图为： 实际的执行过程中，我们可以按照这个 distribution 来选择动作，或者 直接选择 概率最大的那个 action。 但是，为了得到更好的 policy，我们必须进行更新。那么，如何来优化这个问题呢？我们需要某些度量（metric）来衡量 policy 的好坏。 我们定一个函数 $J(\pi)$，表示 一个策略所能得到的折扣的奖赏，从初始状态 s0 出发得到的所有的平均： 我们发现这个函数的确很好的表达了，一个 policy 有多好。但是问题是很难估计，好消息是：we don't have to。 我们需要关注的仅仅是如何改善其质量就行了。如果我们知道这个 function 的 gradient，就变的很 trivial （专门查了词典，这个是：琐碎的，微不足道的，的意思，恩，不用谢）。 有一个很简便的方法来计算这个函数的梯度： 这里其实从 目标函数 到这个梯度的变换，有点突然，我们先跳过这个过程，就假设已经是这样子了。后面，我再给出比较详细的推导过程。 这里可以参考 Policy Gradient 的原始paper：Policy Gradient Methods for Reinforcement Learning with Function Approximation 或者是 David Silver 的 YouTube 课程：https://www.youtube.com/watch?v=KHZVXao4qXs 简单而言，这个期望内部的两项： 第一项，是优势函数，即：选择该 action 的优势，当低于 average value 的时候，该项为 negative，当比 average 要好的时候，该项为 positive；是一个标量（scalar）； 第二项，告诉我们了使得 log 函数 增加的方向； 将这两项乘起来，我们发现：likelihood of actions that are better than average is increased, and likelihood of actions worse than average is decreased. Fortunately, running an episode with a policy π yields samples distributed exactly as we need. States encountered and actions taken are indeed an unbiased sample from the and π(s) distributions. That’s great news. We can simply let our agent run in the environment and record the (s, a, r, s’) samples. When we gather enough of them, we use the formula above to find a good approximation of the gradient . We can then use any of the existing techniques based on gradient descend to improve our policy. Actor-Critic： 我们首先要计算的是优势函数 A(s, a)，将其展开： 运行一次得到的 sample 可以给我们提供一个 Q(s, a) 函数的 unbiased estimation。我们知道，这个时候，我们仅仅需要知道 V(s) 就可以计算 A(s, a）。 这个 value function 是容易用 NN 来计算的，就像在 DQN 中估计 action-value function 一样。相比较而言，这个更简单，因为 每个 state 仅仅有一个 value。 我们可以将 value function 和 action-value function 联合的进行预测。最终的网络框架如下： 这里，我们有两个东西需要优化，即： actor 以及 critic。 actor：优化这个 policy，使得其表现的越来越好； critic：尝试估计 value function，使其更加准确； 这些东西来自于 the Policy Gradient Theorem : 简单来讲，就是：actor 执行动作，然后 critic 进行评价，说这个动作的选择是好是坏。 Parallel agents： 如果只用 单个 agent 进行样本的采集，那么我们得到的样本就非常有可能是高度相关的，这会使得 machine learning 的model 出问题。因为 machine learning 学习的条件是：sample 满足独立同分布的性质。但是不能是这样子高度相关的。在 DQN 中，我们引入了 experience replay 来克服这个难题。但是，这样子就是 offline 的了，因为你是先 sampling，然后将其存储起来，然后再 update 你的参数。 那么，问题来了，能否 online 的进行学习呢？并且在这个过程中，仍然打破这种高度相关性呢？ We can run several agents in parallel, each with its own copy of the environment, and use their samples as they arrive. 1. Different agents will likely experience different states and transitions, thus avoiding the correlation2. 2. Another benefit is that this approach needs much less memory, because we don’t need to store the samples. 此外，还有一个概念也是非常重要的：N-step return 。 通常我们计算的 Q(s, a), V(s) or A(s, a) 函数的时候，我们只是计算了 1-step 的 return。 在这种情况下，我们利用的是从 sample （s0, a0, r0, s1）获得的 即刻奖励（immediate return），然后该函数下一步预测 value 给我们提供了一个估计 approximation。但是，我们可以利用更多的步骤来提供另外一个估计： 或者 n-step return： The n-step return has an advantage that changes in the approximated function get propagated much more quickly. Let’s say that the agent experienced a transition with unexpected reward. In 1-step return scenario, the value function would only change slowly one step backwards with each iteration. In n-step return however, the change is propagated n steps backwards each iteration, thus much quicker. N-step return has its drawbacks. It’s higher variance because the value depends on a chain of actions which can lead into many different states. This might endanger the convergence. 这个就是 异步优势actor-critic 算法（Asynchronous advantage actor-critic , 即：A3C）。 以上是 A3C 的算法部分，下面从 coding 的角度来看待这个算法： 基于 python+Keras+gym 的code 实现，可以参考这个 GitHub 链接：https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py 所涉及到的大致流程，可以归纳为： 在这其中，最重要的是 loss function 的定义： 其中， is the loss of the policy, is the value error and is a regularization term. These parts are multiplied by constants and , which determine what part we stress more. 下面分别对这三个部分进行介绍： 1. Policy Loss： 我们定义 objective function $J(\pi)$ 如下： 这个是：通过策略 $\pi$ 平均所有起始状态所得到的总的 reward（total reward an agent can achieve under policy $\pi$ averaged over all starting states）。 根据 Policy Gradient Theorem 我们可以得到该函数的 gradient： 我们尝试最大化这个函数，那么，对应的 loss 就是这个 负函数： 我们将 A(s,a) 看做是一个 constant，然后重新将上述函数改写为如下的形式： 我们就对于minibatch 中所有样本进行平均，来扫一遍这个期望值。最终的 loss 可以记为： 2. Value Loss: the truth value function V(s) 应该是满足 Bellman Equation 的： 而我们估计的 V(s) 应该是收敛的，那么，根据上述式子，我们可以计算该 error： 这里大家可能比较模糊，刚开始我也是比较晕，这里的 groundtruth 是怎么得到的？？？ 其实这里是根据 sampling 到的样本，然后计算两个 V(s) 之间的误差，看这两个 value function 之间的差距。 所以，我们定义 Lv 为 mean squared error （given all samples）: 3. Regularizaiton with Policy Entropy : 为何要加这一项呢？我们想要在 agent 与 environment 进行互动的过程中，平衡 探索和利用，我们想去以一定的几率来尝试其他的 action，从而不至于采样得到的样本太过于集中。所以，引入这个 entropy，来使得输出的分布，能够更加的平衡。举个例子： fully deterministic policy [1, 0, 0, 0] 的 entropy 是 0 ； 而 totally uniform policy[0.25, 0.25, 0.25, 0.25]的 entropy 对于四个value的分布，值是最大的。 我们为了使得输出的分布更加均衡，所以要最大化这个 entropy，那么就是 minimize 这个 负的 entropy。 总而言之，我们可以借助于现有的 deep learning 的框架来 minimize 这个这些 total loss，以达到 优化网络参数的目的。 Reference： 1. https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py 2. https://jaromiru.com/2017/03/26/lets-make-an-a3c-implementation/ 3. https://www.youtube.com/watch?v=KHZVXao4qXs 4. https://github.com/ikostrikov/pytorch-a3c ====================================================== Policy Gradient Method 目标函数梯度的计算过程： ====================================================== reference paper：policy-gradient-methods-for-reinforcement-learning-with-function-approximation （NIPS 2000, MIT press） 过去有很多算法都是基于 value-function 进行的，虽然取得了很大的进展，但是这种方法有如下两个局限性： 首先，这类方法的目标在于找到 deterministic policy，但是最优的策略通常都是 stochastic 的，以特定的概率选择不同的 action； 其次，一个任意的小改变，都可能会导致一个 action 是否会被选择。这个不连续的改变，已经被普遍认为是建立收敛精度的关键瓶颈。 而策略梯度的方法，则是从另外一个角度来看待这个问题。我们知道，我们的目标就是想学习一个，从 state 到 action 的一个策略而已，那么，我们有必要非得先学一个 value function 吗？我们可以直接输入一个 state，然后经过 NN，输出action 的distribution 就行了嘛，然后，将 NN 中的参数，看做是可调节的 policy 的参数。我们假设 policy 的实际执行的表现为 $\rho$，即：the averaged reward per step。我们可以直接对这个 $\rho$ 求偏导，然后进行参数更新，就可以进行学习了嘛： 如果上述公式是成立的，那么，$\theta$ 通常都可以保证可以收敛到局部最优策略。而这个文章就提供了上述梯度的一个无偏估计，这是利用 一个估计的满足特定属性的 value function，从 experience 中进行估计。 1. Policy Gradient Theorem （策略梯度定理） 这里讨论的是标准的 reinforcement learning framework，有一个 agent 与 环境进行交互，并且满足马尔科夫属性。 在每个时刻 $t \in {0, 1, 2, ... }$ 的状态，动作，奖励 分别记为：st, at, rt。而环境的动态特征可以通过 状态转移概率（state transition probability）来刻画。 从上面，可以发现各个概念的符号表示及其意义。 ====>> 未完，待续 。。。 ====================================================== Pytorch for A3C ====================================================== 本文将继续以 Pytorch 框架为基础，从代码层次上来看具体的实现，本文所用的 code，来自于：https://github.com/ikostrikov/pytorch-a3c 代码的层次如下所示： 我们来看几个核心的code： main.py ====>> 所用到的各种重要的参数设置及其初始化： train.py model.py
深度强化学习的18个关键问题 from: https://zhuanlan.zhihu.com/p/32153603 85 人赞了该文章 深度强化学习的问题在哪里？未来怎么走？哪些方面可以突破？ 这两天我阅读了两篇篇猛文A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview ，作者排山倒海的引用了200多篇文献，阐述强化学习未来的方向。原文归纳出深度强化学习中的常见科学问题，并列出了目前解法与相关综述，我在这里做出整理，抽取了相关的论文。 这里精选18个关键问题，涵盖空间搜索、探索利用、策略评估、内存使用、网络设计、反馈激励等等话题。本文精选了73篇论文（其中2017年论文有27篇，2016年论文有21篇）为了方便阅读，原标题放在文章最后，可以根据索引找到。 TODO list：文章内容还不够充实，但是论文是全的。未来一段时间会把论文的链接找齐，下载好然后打个包传到百度云上，预计一两天完成。（2017/12/19） 问题一：预测与策略评估 prediction, policy evaluation 万变不离其宗，Temporal Difference方法仍然是策略评估的核心哲学【Sutton 1988】。TD的拓展版本和她本身一样鼎鼎大名——1992年的Q-learning与2015年的DQN。 美中不足，TD Learning中很容易出现Over-Estimate（高估）问题，具体原因如下： The max operator in standard Q-learning and DQN use the same values both to select and to evaluate an action. —— van Hasselt 旷世猛将van Hasselt先生很喜欢处理Over-Estimate问题，他先搞出一个Double Q-learning【van Hasselt 2010】大闹NIPS，六年后搞出深度学习版本的Double DQN【van Hasselt 2016a】！ 问题二：控制与最佳策略选择 control, finding optimal policy 目前解法有三个流派，一图胜千言： 台大李宏毅教授的Slide 最传统的方法是Value-Based，就是选择有最优Value的Action。最经典方法有：Q-learning 【Watkins and Dayan 1992】、SARSA 【Sutton and Barto 2017】 后来Policy-Based方法引起注意，最开始是REINFORCE算法【Williams 1992】，后来策略梯度Policy Gradient【Sutton 2000】出现。 最时行的Actor-Critic 【Barto et al 1983】把两者做了结合。楼上Sutton老爷子的好学生、AlphaGo的总设计师David Silver同志提出了Deterministic Policy Gradient，表面上是PG，实际讲了一堆AC，这个改进史称DPG【Silver 2014】 Actor-Critic的循环促进过程 问题三：不稳定与不收敛问题 Instability and Divergence when combining off-policy，function approximation，bootstrapping 早在1997年Tsitsiklis就证明了如果Function Approximator采用了神经网络这种非线性的黑箱，那么其收敛性和稳定性是无法保证的。 分水岭论文Deep Q-learning Network【Mnih et al 2013】中提到：虽然我们的结果看上去很好，但是没有任何理论依据（原文很狡猾的反过来说一遍）。 This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in stable manner 征服Atari游戏的DQN DQN的改良主要依靠两个Trick： 经验回放【Lin 1993】（虽然做不到完美的独立同分布，但还是要尽力减少数据之间的关联性） Target Network【Mnih 2015】(Estimated Network和Target Network不能同时更新参数，应该另设Target Network以保证稳定性) Since the network Q being updated is also used in calculating the target value, the Q update is prone to divergence.（为什么我们要用Target Network） 下面几篇论文都是DQN相关话题的： 经验回放升级版：Prioritized Experience Replay 【Schaul 2016】 更好探索策略 【Osband 2016】 DQN加速 【He 2017a】 通过平均减少方差与不稳定性Averaged-DQN 【Anschel 2017】 下面跳出DQN的范畴—— Duel DQN【Wang 2016c】（ICML2016最佳论文） Tips：阅读此文请掌握DQN、Double DQN、Prioritized Experience Replay这三个背景。 异步算法A3C 【Mnih 2016】 TRPO(Trust Region Policy Optimization)【Schulman 2015】 Distributed Proximal Policy Optimization 【Heess 2017】 Policy gradient与Q-learning 的结合【O'Donoghue 2017、Nachum 2017、 Gu 2017、Schulman 2017】 GTD 【Sutton 2009a、Sutton 2009b、Mahmood 2014】 Emphatic-TD 【Sutton 2016】 问题四：End-to-End下的训练感知与控制 train perception and control jointly end-to-end 现有解法是Guided Policy Search 【Levine et al 2016a】 问题五：数据利用效率 data/sample efficiency 现有解法有： Q-learning与Actor-Critic 经验回放下的actor-critic 【Wang et al 2017b】 PGQ，policy gradient and Q-learning 【O'Donoghue et al 2017】 Q-Prop, policy gradient with off-policy critic 【Gu et al 2017】 return-based off-policy control, Retrace 【Munos et al 2016】, Reactor 【Gruslyset al 2017】 learning to learn, 【Duan et al 2017、Wang et al 2016a、Lake et al 2015】 问题六：无法取得激励 reward function not available 现有解法基本上围绕模仿学习 吴恩达的逆强化学习【Ng and Russell 2000)】 learn from demonstration 【Hester et al 2017】 imitation learning with GANs 【Ho and Ermon 2016、Stadie et al 2017】 （其TensorFlow实现在imitation） train dialogue policy jointly with reward model 【Su et al 2016b】 问题七：探索-利用问题（最经典的问题） exploration-exploitation tradeoff 现有解法有： unify count-based exploration and intrinsic motivation 【Bellemare et al 2017】 under-appreciated reward exploration 【Nachum et al 2017)】 deep exploration via bootstrapped DQN 【Osband et al 2016)】 variational information maximizing exploration 【Houthooft et al 2016】 问题八：基于模型的学习 model-based learning 现有解法： Sutton老爷子教科书里的经典安利：Dyna-Q 【Sutton 1990】 model-free与model-based的结合使用【Chebotar et al 2017】 问题九：无模型规划 model-free planning 比较新的解法有两个： Value Iteration Networks【Tamar et al 2016】是勇夺NIPS2016最佳论文头衔的猛文，知乎上已经有专门的文章解说了：Value iteration Network，VIN的TensorFlow实现在tensorflow-value-iteration-networks。 DeepMind的Silver大神发表的Predictron方法 【Silver et al 2016b】，其TensorFlow实现是predictron。 问题十：它山之石可以攻玉 focus on salient parts @贾扬清 大神曾经说过： 伯克利人工智能方向的博士生，入学一年以后资格考试要考这几个内容：强化学习和Robotics、 统计和概率图模型、 计算机视觉和图像处理、 语音和自然语言处理、 核方法及其理论、 搜索，CSP，逻辑，Planning等如果真的想做人工智能，建议都了解一下，不是说都要搞懂搞透，但是至少要达到开会的时候和人在poster前面谈笑风生不出错的程度吧。 因此，一个很好的思路是从计算机视觉与自然语言处理领域汲取灵感，例如下文中将会提到的unsupervised auxiliary learning方法借鉴了RNN+LSTM中的大量操作。 下面是CV和NLP方面的几个简介：物体检测 【Mnih 2014】、机器翻译 【Bahdanau 2015】、图像标注【Xu 2015】、用Attention代替CNN和RNN【Vaswani 2017】等等。 问题十一：长时间数据储存 data storage over long time, separating from computation 最出名的解法是在Nature上大秀一把的Differentiable Neural Computer【Graves et al 2016】 问题十二：无回报训练 benefit from non-reward training signals in environments 现有解法围绕着无监督学习开展 Horde 【Sutton et al 2011】 极其优秀的工作：unsupervised reinforcement and auxiliary learning 【Jaderberg et al 2017】 learn to navigate with unsupervised auxiliary learning 【Mirowski et al 2017】 大名鼎鼎的GANs 【Goodfellow et al 2014】 问题十三：跨领域学习 learn knowledge from different domains 现有解法全部围绕迁移学习走 【Taylor and Stone, 2009、Pan and Yang 2010、Weiss et al 2016】，learn invariant features to transfer skills 【Gupta et al 2017】 问题十四：有标签数据与无标签数据混合学习 benefit from both labelled and unlabelled data 现有解法全部围绕半监督学习 【Zhu and Goldberg 2009】 learn with MDPs both with and without reward functions 【Finn et al 2017)】 learn with expert's trajectories and those may not from experts 【Audiffren et al 2015】 问题十五：多层抽象差分空间的表示与推断 learn, plan, and represent knowledge with spatio-temporal abstraction at multiple levels 现有解法：多层强化学习 【Barto and Mahadevan 2003】 strategic attentive writer to learn macro-actions 【Vezhnevets et al 2016】 integrate temporal abstraction with intrinsic motivation 【Kulkarni et al 2016】 stochastic neural networks for hierarchical RL 【Florensa et al 2017】 lifelong learning with hierarchical RL 【Tessler et al 2017】 问题十六：不同任务环境快速适应 adapt rapidly to new tasks 现有解法基本上是learn to learn learn a flexible RNN model to handle a family of RL tasks 【Duan et al 2017、Wang et al 2016a】 one/few/zero-shot learning 【Duan et al 2017、Johnson et al 2016、 Kaiser et al 2017b、Koch et al 2015、Lake et al 2015、Li and Malik 2017、Ravi and Larochelle, 2017、Vinyals et al 2016) 问题十七：巨型搜索空间 gigantic search space 现有解法依然是蒙特卡洛搜索，详情可以参考初代AlphaGo的实现【Silver et al 2016a】 问题十八：神经网络架构设计 （neural networks architecture design ） 现有的网络架构搜索方法【Baker et al 2017、Zoph and Le 2017】，其中Zoph的工作分量非常重。 新的架构有【Kaiser et al 2017a、Silver et al 2016b、Tamar et al 2016、Vaswani et al 2017、Wang et al 2016c】 参考文献 Anschel, O., Baram, N., and Shimkin, N. (2017). Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In the International Conference on Machine Learning (ICML). Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. (2015). Maximum entropy semisupervised inverse reinforcement learning. In the International Joint Conference on Artificial Intelligence (IJCAI). Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2017). An actor-critic algorithm for sequence prediction. In the InternationalConference on Learning Representations (ICLR). Baker, B., Gupta, O., Naik, N., and Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In the International Conference on Learning Representations (ICLR). Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379. Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846 Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv e-prints. Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In the International Conference on Machine Learning (ICML) Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J.,Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-Shot Imitation Learning. ArXiv e-prints. Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between GANs, inverse reinforcement learning, and energy-based models. In NIPS 2016 Workshopon Adversarial Training. Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. In the International Conference on Learning Representations (ICLR) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., , and Bengio, Y. (2014). Generative adversarial nets. In the AnnualConference on Neural Information Processing Systems (NIPS), page 2672?2680. Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Col- ´ menarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., nech Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476 Gruslys, A., Gheshlaghi Azar, M., Bellemare, M. G., and Munos, R. (2017). The Reactor: A Sample-Efficient Actor-Critic Architecture. ArXiv e-prints Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017). Q-Prop: Sampleefficient policy gradient with an off-policy critic. In the InternationalConference on Learning Representations (ICLR). Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant feature spaces to transfer skills with reinforcement learning. In the International Conference on Learning Representations (ICLR). He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2017a). Learning to play in a day: Faster deep reinforcement learning by optimality tightening. In the International Conference on Learning Representations (ICLR) Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv e-prints Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence, 247:170–86. Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In the Annual Conference on Neural Information Processing Systems (NIPS). Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In the Annual Conference on Neural Information Processing Systems (NIPS). Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In the International Conference on Learning Representations (ICLR). Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Watten- ´berg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. ArXive-prints. Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. (2017a). One Model To Learn Them All. ArXiv e-prints. Kaiser, Ł., Nachum, O., Roy, A., and Bengio, S. (2017b). Learning to Remember Rare Events. In the International Conference on Learning Representations (ICLR). Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In the International Conference on Machine Learning (ICML). Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In the Annual Conference on Neural Information Processing Systems (NIPS) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338. Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17:1–40. Li, K. and Malik, J. (2017). Learning to optimize. In the International Conference on Learning Representations (ICLR). Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., & Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. Computer Science, 8(6), A187. Lin, L. J. (1993). Reinforcement learning for robots using neural networks. Mahmood, A. R., van Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS). Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. (2017). Learning to navigate in complex environments. In the International Conference on Learning Representations (ICLR). Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. In the Annual Conference on Neural Information Processing Systems(NIPS). Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015).Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In the International Conference on Machine Learning (ICML) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G.(2016). Safe and efficient offpolicy reinforcement learning. In the Annual Conference on Neural Information Processing Systems (NIPS). Nachum, O., Norouzi, M., and Schuurmans, D. (2017). Improving policy gradient by exploring under-appreciated rewards. In the International Conference on Learning Representations (ICLR). Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. ArXive-prints. Ng, A. and Russell, S. (2000).Algorithms for inverse reinforcement learning. In the International Conference on Machine Learning (ICML). O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). PGQ: Combining policy gradient and q-learning. In the International Conference on Learning Representations (ICLR). Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped DQN. In the Annual Conference on Neural Information Processing Systems (NIPS). Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 – 1359. Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. In the International Conference on Learning Representations (ICLR). Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In the International Conference on Learning Representations (ICLR). Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In the International Conference on Machine Learning (ICML). Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence Between Policy Gradients and Soft Q-Learning. ArXiv e-prints. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. International Conference on International Conference on Machine Learning (pp.387-395). JMLR.org. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016a). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489. Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016b). The predictron: End-to-end learning and planning. In NIPS 2016 Deep Reinforcement Learning Workshop. Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).Third person imitation learning. In the International Conference on Learning Representations (ICLR). Sutton, R. S. and Barto, A. G. (2017). Reinforcement Learning: An Introduction (2nd Edition, in preparation). MIT Press. Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In the Annual Conference on Neural Information Processing Systems(NIPS). Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, ´E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In the International Conference on Machine Learning (ICML). Sutton, R. S., Szepesvari, C., and Maei, H. R. (2009b). A convergent O( ´ n) algorithm for off-policy temporal-difference learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS). Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction, , proc. of 10th. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17:1–29 Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,3(1):9–44. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the International Conference on Machine Learning (ICML). Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In the Annual Conference on Neural Information Processing Systems (NIPS). Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685. Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In the AAAI Conference on Artificial Intelligence (AAAI). van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23:, Conference on Neural Information Processing Systems 2010. van Hasselt, H., Guez, A., , and Silver, D. (2016a). Deep reinforcement learning with double Qlearning. In the AAAI Conference on Artificial Intelligence (AAAI). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. ArXiv e-prints. Vezhnevets, A. S., Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., and Kavukcuoglu, K. (2016). Strategic attentive writer for learning macro-actions. In the Annual Conference on Neural Information Processing Systems (NIPS). Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In the Annual Conference on Neural Information Processing Systems (NIPS). Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016a). Learning to reinforcement learn. arXiv:1611.05763v1. Wang, S. I., Liang, P., and Manning, C. D. (2016b). Learning language games through interaction. In the Association for Computational Linguistics annual meeting (ACL) Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In the InternationalConference on Machine Learning (ICML). Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292 Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(9) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256. Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A.,Salakhutdinov, R., Zemel, R. S., and Bengio,Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In the International Conference on Machine Learning (ICML). Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Morgan & Claypool Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In the International Conference on Learning Representations (ICLR)
ResNet, AlexNet, VGG, Inception: Understanding various architectures of Convolutional Networks by KOUSTUBH this blog from: http://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/ Convolutional neural networks are fantastic for visual recognition tasks. Good ConvNets are beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: ‘higher the number of hidden layers, better the network’. AlexNet, VGG, Inception, ResNet are some of the popular networks. Why do these networks work so well? How are they designed? Why do they have the structures they have? One wonders. The answer to these questions is not trivial and certainly, can’t be covered in one blog post. However, in this blog, I shall try to discuss some of these questions. Network architecture design is a complicated process and will take a while to learn and even longer to experiment designing on your own. But first, let’s put things in perspective: Why are ConvNets beating traditional computer vision? 图像识别（Image classification）的任务是将给定的图像正确的分类为预先定义的种类。传统方法，将该过程分为两个模块：feature extraction 以及 classification。 Feature Extraction：involves extracting a higher level of information from raw pixel values that can capture the distinction among the categories involved. This feature extraction is done in an unsupervised manner wherein the classes of the image have nothing to do with information extracted from pixels. Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After the feature is extracted, a classification module is trained with the images and their associated labels. A few examples of this module are SVM, Logistic Regression, Random Forest, decision trees etc. 但是这个流程的问题在于：特征提取的过程无法根据 classes 和 images 进行微调（the feature extraction cannot be tweaked according to the classes and images）。所以，如果这个选中的特征缺乏表达性来区分种类，模型分类的准确度可定不会很好，不管你采用哪种分类的策略。A common theme among the state of the art following the traditional pipeline has been, to pick multiple feature extractors and club them inventively to get a better feature. But this involves too many heuristics as well as manual labor to tweak parameters according to the domain to reach a decent level of accuracy. By decent I mean, reaching close to human level accuracy. That’s why it took years to build a good computer vision system(like OCR, face verification, image classifiers, object detectors etc), that can work with a wide variety of data encountered during practical application, using traditional computer vision. We once produced better results using ConvNets for a company(a client of my start-up) in 6 weeks, which took them close to a year to achieve using traditional computer vision. Another problem with this method is that it is completely different from how we humans learn to recognize things. Just after birth, a child is incapable of perceiving his surroundings, but as he progresses and processes data, he learns to identify things. This is the philosophy behind deep learning, wherein no hard-coded feature extractor is built in. It combines the extraction and classification modules into one integrated system and it learns to extract, by discriminating representations from the images and classify them based on supervised data. One such system is multilayer perceptrons aka neural networks which are multiple layers of neurons densely connected to each other. A deep vanilla neural network has such a large number of parameters involved that it is impossible to train such a system without overfitting the model due to the lack of a sufficient number of training examples. But with Convolutional Neural Networks(ConvNets), the task of training the whole network from the scratch can be carried out using a large dataset like ImageNet. The reason behind this is, sharing of parameters between the neurons and sparse connections in convolutional layers. It can be seen in this figure 2. In the convolution operation, the neurons in one layer are only locally connected to the input neurons and the set of parameters are shared across the 2-D feature map. In order to understand the design philosophy of ConvNets, one must ask: What is the objective here ? a. Accuracy : If you are building an intelligent machine, it is absolutely critical that it must be as accurate as possible. One fair question to ask here is that ‘accuracy not only depends on the network but also on the amount of data available for training’. Hence, these networks are compared on a standard dataset called ImageNet. ImageNet project is an ongoing effort and currently has 14,197,122 images from 21841 different categories. Since 2010, ImageNet has been running an annual competition in visual recognition where participants are provided with 1.2 million images belonging to 1000 different classes from Imagenet data-set. So, each network architecture reports accuracy using these 1.2 million images of 1000 classes. b. Computation: Most ConvNets have huge memory and computation requirements, especially while training. Hence, this becomes an important concern. Similarly, the size of the final trained model becomes an important to consider if you are looking to deploy a model to run locally on mobile. As you can guess, it takes a more computationally intensive network to produce more accuracy. So, there is always a trade-off between accuracy and computation. Apart from these, there are many other factors like ease of training, the ability of a network to generalize well etc. The networks described below are the most popular ones and are presented in the order that they were published and also had increasingly better accuracy from the earlier ones. AlexNet This architecture was one of the first deep networks to push ImageNet Classification accuracy by a significant stride in comparison to traditional methodologies. It is composed of 5 convolutional layers followed by 3 fully connected layers, as depicted in Figure 1. AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid function which was the earlier standard for traditional neural networks. ReLu is given by f(x) = max(0,x) The advantage of the ReLu over sigmoid is that it trains much faster than the latter because the derivative of sigmoid becomes very small in the saturating region and therefore the updates to the weights almost vanish(Figure 4). This is called vanishing gradient problem. In the network, ReLu layer is put after each and every convolutional and fully-connected layers(FC). Another problem that this architecture solved was reducing the over-fitting by using a Dropout layer after every FC layer. Dropout layer has a probability,(p), associated with it and is applied at every neuron of the response map separately. It randomly switches off the activation with the probability p, as can be seen in figure 5. Why does DropOut work? The idea behind the dropout is similar to the model ensembles. Due to the dropout layer, different sets of neurons which are switched off, represent a different architecture and all these different architectures are trained in parallel with weight given to each subset and the summation of weights being one. For n neurons attached to DropOut, the number of subset architectures formed is 2^n. So it amounts to prediction being averaged over these ensembles of models. This provides a structured model regularization which helps in avoiding the over-fitting. Another view of DropOut being helpful is that since neurons are randomly chosen, they tend to avoid developing co-adaptations among themselves thereby enabling them to develop meaningful features, independent of others. VGG16 This architecture is from VGG group, Oxford. It makes the improvement over AlexNet by replacing large kernel-sized filters(11 and 5 in the first and second convolutional layer, respectively) with multiple 3X3 kernel-sized filters one after another. With a given receptive field(the effective area size of input image on which output depends), multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features, and that too at a lower cost. For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7, but the number of parameters involved is 3*(9C^2) in comparison to 49C^2 parameters of kernels with a size of 7. Here, it is assumed that the number of input and output channel of layers is C.Also, 3X3 kernels help in retaining finer level properties of the image. The network architecture is given in the table. You can see that in VGG-D, there are blocks with same filter size applied multiple times to extract more complex and representative features. This concept of blocks/modules became a common theme in the networks after VGG. The VGG convolutional layers are followed by 3 fully connected layers. The width of the network starts at a small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves the top-5 accuracy of 92.3 % on ImageNet. GoogLeNet/Inception: While VGG achieves a phenomenal accuracy on ImageNet dataset, its deployment on even the most modest sized GPUs is a problem because of huge computational requirements, both in terms of memory and time. It becomes inefficient due to large width of convolutional layers. For instance, a convolutional layer with 3X3 kernel size which takes 512 channels as input and outputs 512 channels, the order of calculations is 9X512X512. In a convolutional operation at one location, every output channel (512 in the example above), is connected to every input channel, and so we call it a dense connection architecture. The GoogLeNet builds on the idea that most of the activations in deep network are either unnecessary(value of zero) or redundant because of correlations between them. Therefore most efficient architecture of a deep network will have a sparse connection between the activations, which implies that all 512 output channels will not have a connection with all the 512 input channels. There are techniques to prune out such connections which would result in a sparse weight/connection. But kernels for sparse matrix multiplication are not optimized in BLAS or CuBlas(CUDA for GPU) packages which render them to be even slower than their dense counterparts. So GoogLeNet devised a module called inception module that approximates a sparse CNN with a normal dense construction(shown in the figure). Since only a small number of neurons are effective as mentioned earlier, width/number of the convolutional filters of a particular kernel size is kept small. Also, it uses convolutions of different sizes to capture details at varied scales(5X5, 3X3, 1X1). Another salient point about the module is that it has a so-called bottleneck layer (1X1 convolutions in the figure). It helps in massive reduction of the computation requirement as explained below. Let us take the first inception module of GoogLeNet as an example which has 192 channels as input. It has just 128 filters of 3X3 kernel size and 32 filters of 5X5 size. The order of computation for 5X5 filters is 25X32X192 which can blow up as we go deeper into the network when the width of the network and the number of 5X5 filter further increases. In order to avoid this, the inception module uses 1X1 convolutions before applying larger sized kernels to reduce the dimension of the input channels, before feeding into those convolutions. So in first inception module, the input to the module is first fed into 1X1 convolutions with just 16 filters before it is fed into 5X5 convolutions. This reduces the computations to 16X192 + 25X32X16. All these changes allow the network to have a large width and depth. Another change that GoogLeNet made, was to replace the fully-connected layers at the end with a simple global average pooling which averages out the channel values across the 2D feature map, after the last convolutional layer. This drastically reduces the total number of parameters. This can be understood from AlexNet, where FC layers contain approx. 90% of parameters. Use of a large network width and depth allows GoogLeNet to remove the FC layers without affecting the accuracy. It achieves 93.3% top-5 accuracy on ImageNet and is much faster than VGG. Residual Networks As per what we have seen so far, increasing the depth should increase the accuracy of the network, as long as over-fitting is taken care of. But the problem with increased depth is that the signal required to change the weights, which arises from the end of the network by comparing ground-truth and prediction becomes very small at the earlier layers, because of increased depth. It essentially means that earlier layers are almost negligible learned. This is called vanishing gradient. The second problem with training the deeper networks is, performing the optimization on huge parameter space and therefore naively adding the layers leading to higher training error. Residual networks allow training of such deep networks by constructing the network through modules called residual models as shown in the figure. This is called degradation problem. The intuition around why it works can be seen as follows: Imagine a network, A which produces x amount of training error. Construct a network B by adding few layers on top of A and put parameter values in those layers in such a way that they do nothing to the outputs from A. Let’s call the additional layer as C. This would mean the same x amount of training error for the new network. So while training network B, the training error should not be above the training error of A. And since it DOES happen, the only reason is that learning the identity mapping(doing nothing to inputs and just copying as it is) with the added layers-C is not a trivial problem, which the solver does not achieve. To solve this, the module shown above creates a direct path between the input and output to the module implying an identity mapping and the added layer-C just need to learn the features on top of already available input. Since C is learning only the residual, the whole module is called residual module. Also, similar to GoogLeNet, it uses a global average pooling followed by the classification layer.Through the changes mentioned, ResNets were learned with network depth of as large as 152. It achieves better accuracy than VGGNet and GoogLeNet while being computationally more efficient than VGGNet. ResNet-152 achieves 95.51 top-5 accuracies. The architecture is similar to the VGGNet consisting mostly of 3X3 filters. From the VGGNet, shortcut connection as described above is inserted to form a residual network. This can be seen in the figure which shows a small snippet of earlier layer synthesis from VGG-19. The power of the residual networks can be judged from one of the experiments in paper 4. The plain 34 layer network had higher validation error than the 18 layer plain network. This is where we realize the degradation problem. And the same 34 layer network when converted into the residual network has much lesser training error than the 18 layer residual network. As we design more and more sophisticated architectures, some of the networks may not stay relevant few years down the line but the core priciples that led to their design must be understood. Hope this articles offered you a good perspective into the design of neural network architectures.
深度强化学习之：模仿学习（imitation learning） 2017.12.10 本文所涉及到的 模仿学习，则是从给定的展示中进行学习。机器在这个过程中，也和环境进行交互，但是，并没有显示的得到 reward。在某些任务上，也很难定义 reward。如：自动驾驶，撞死一人，reward为多少，撞到一辆车，reward 为多少，撞到小动物，reward 为多少，撞到 X，reward 又是多少，诸如此类。。。而某些人类所定义的 reward，可能会造成不可控制的行为，如：我们想让 agent 去考试，目标是让其考 100，那么，这个 agent 则可能会为了考 100，而采取作弊的方式，那么，这个就比较尴尬了，是吧 ？我们当然想让 agent 在学习到某些本领的同时，能遵守一定的规则。给他们展示怎么做，然后让其自己去学习，会是一个比较好的方式。 本文所涉及的三种方法：1. 行为克隆，2. 逆强化学习，3. GAN 的方法 接下来，我们将分别介绍这三种方法： 一、Behavior Cloning : 这里以自动驾驶为例，首先我们要收集一堆数据，就是 demo，然后人类做什么，就让机器做什么。其实就是监督学习（supervised learning），让 agent 选择的动作和 给定的动作是一致的。。。 但是，这个方法是有问题的，因为 你给定的 data，是有限的，而且是有限制的。那么，在其他数据上进行测试，则可能不会很好。 要么，你增加 training data，加入平常 agent 没有看到过的数据，即：dataset aggregation 。 通过不断地增加数据，那么，就可以很好的改进 agent 的策略。有些场景下，也许适应这种方法。。。 而且，你的观测数据 和 策略是有联系的。因为在监督学习当中，我们需要 training data 和 test data 独立同分布。但是，有时候，这两者是不同的，那么，就惨了。。。 于是，另一类方法，出现了，即：Inverse Reinforcement Learning （也称为：Inverse Optimal Control，Inverse Optimal Planning）。 二、Inverse Reinforcement Learning （“Apprenticeship learning via Inverse Reinforcement Learning”， ICML 2004） 顾名思义，IRL 是 反过来的 RL，RL 是根据 reward 进行参数的调整，然后得到一个 policy。大致流程应该是这个样子： 但是， IRL 就不同了，因为他没有显示的 reward，只能根据 人类行为，进行 reward的估计（反推 reward 的函数）。 在得到 reward 函数估计出来之后，再进行 策略函数的估计。。。 原本的 RL，就是给定一个 reward function R(t）（奖励的加和，即：回报），然后，这里我们回顾一下 RL 的大致过程（这里以 policy gradient 方法为例） 而 Inverse Reinforcement Learning 这是下面的这个思路： 逆强化学习 则是在给定一个专家之后（expert policy），通过不断地寻找 reward function 来满足给定的 statement（即，解释专家的行为，explaining expert behavior）。。。 专家的这个回报是最大的，英雄级别的，比任何其他的 actor 得到的都多。。。 据说，这个 IRL 和 structure learning 是非常相似的： 可以看到，貌似真是的哎。。。然后，复习下什么是 结构化学习： 我们对比下， IRL 和 结构化学习： ======================================================================= 我们可以看到，由于我们无法知道得到的 reward 情况，所以，我们只能去估计这些 奖励的函数，然后，我们用参数 w 来进行估计： 所以， r 可以写成 w 和 f(s, a) 相乘的形式。w 就是我们所要优化的参数，而 f（s,a）就是我们提取的 feature vector。 那么 IRL 的流程究竟是怎样的呢？？？ 上面就是 IRL 所做的整个流程了。 三、GAN for Imitation Learning （Generative Adversarial imitation learning, NIPS, 2016） 那么如何用 GAN 来做这个事情呢？对应到这件事情上，我们知道，我们想得到的 轨迹 是属于某一个高维的空间中，而 expert 给定的那些轨迹，我们假设是属于一个 distribution，我们想让我们的 model，也去 predict 一个分布出来，然后使得这两者之间尽可能的接近。从而完成 actor 的训练过程，示意图如下所示： =============================== 过程 ================================ ====>> Generator：产生出一个轨迹， ====>> Discriminator：判断给定的轨迹是否是 expert 做的？ ========================================================================== Recap：Sentence Generation and Chat-bot ========================================================================== =========================================================== =========================================================== Examples of Recent Study :
Collaborative Deep Reinforcement Learning for Joint Object Search CVPR 2017 Motivation： 传统的 bottom-up object region proposals 的方法，由于提取了较多的 proposal，导致后续计算必须依赖于抢的计算能力，如 GPU 等。那么，在计算机不足的情况下，则会导致应用范围受限。而 Active search method （就是 RL 的方法） 则提供了不错的方法，可以很大程度上降低需要评估的 proposal 数量。 我们检查了在交互过程中，多个物体之间的 Joint Active Search 的问题。 On the one hand, it is interesting to consider such a collabrative detection "game" played by multiple agents under an RL setting; On the other hand, it seems especially beneficial in the context of visual object localization where different objects often appear with certain correlation patterns, 如：行人骑自行车，座子上的杯子，等等。 这些物体在交互的情况下，可以提供更多的 contextual cues 。这些线索有很好的潜力来促进更加有效的搜索策略。 本文提出一种协助的多智能体 deep RL algorithm 来学习进行联合物体定位的最优策略。我们的 proposal 服从现有的 RL 框架，但是允许多个智能体之间进行协作。在这个领域当中，有两个开放的问题： 1. how to make communications effective in between different agents ; 2. how to jointly learn good policies for all agents. 本文提出通过 gated cross connections between the Q-networks 来学习 inter-agent communication。 所提出的创新点： 1. 是物体检测领域的第一个做 collaborative deep RL algorithm ； 2. propose a novel multi-agent Q-learning solution that facilitates learnable inter-agent communication with gated cross connections between the Q-networks； 3. 本文方法有效的探索了 相关物体之间有用的 contextual information，并且进一步的提升了检测的效果。 3. Collaborative RL for Joint Object Search 3.1. Single Agent RL Object Localization 作者这里首先回顾了常见的单智能体进行物体检测的大致思路，此处不再赘述。 3.2. Collaborative RL for Joint Object Localization 本文将 single agent 的方法推广到 multi-agent，关键的概念有： --- gated cross connections between different Q-networks; --- joint exploitation sampling for generating corresponding training data, --- a vitrual agent implementation that facilitates easy adaptation to existing deep Q-learning algorithm. 3.2.1 Q-Networks with Gates Cross Connections 本文是基于 Q-function 进行拓展的，常规的 Q-function 可以看做是：$Q(s, a; \theta)$，而 Deep Q-network 就是用 NN 来估计 Q 函数。假设对于每一个 agent i 我们有一个 Q-networks $Q^{(i)}(a^{(i), s^{(i)}; \theta^{(i)}})$，那么，在 multi-agent RL 设定下，很自然的就可以设计出一个促进 inter-agent communication 的 Q 函数出来，如： 其中，m(i) 代表了从 agent i 发送出来的信息；M(-i) 代表了从其他 agent 得到的信息。 m 是 3.2.2 Joint Exploitation Sampling
Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach 2017.11.28 Introduction： 人脸属性的识别在社会交互，提供了非常广泛的信息，包括：the person’s identity, demographic (age, gender, and race), hair style, clothing, etc. 基于人脸属性识别的场景也越来越多，如：（i）video Surveillance； （ii）face retrieval；（iii）social media。尽管最近在属性识别上取得了很大的进展，但是，大部分 prior works 限制在预测单个属性（如：age），或者 针对每一个属性学习一个 model，进行识别。为了解决上述的局限性，已经有很多工作在尝试 joint 的预测多个属性【见文章引用 19-23】。但是这些方法都有或多或少的不足： 1. The approaches in [19], [20], [22] used the same features for estimating all the attributes without considering the attribute heterogeneity. 2. The sumproduct network (SPN) adopted in [21] for modeling attribute correlations may not be feasible because of the exponentially growing number of attribute group combinations. 3. The cascade network in [23] also required learning a separate Support Vector Machine (SVM) classifier for each face attribute, and is not an end-to-end learning approach. 图一展示了人脸属性的相关性以及多样性。属性之间关系要么是 pos 要么是 neg。与此同时，单个属性可以是多样的（根据 data type 或者 scale，以及 semantic meaning）。这种属性相关性以及多样性应该被编码到 属性预测模型中去（Such attribute correlation and heterogeneity should be considered in designing face attribute estimation models.）。 Proposed Algorithm： 本文提出一种 Deep Multi-Task Learning (DMTL) approach 来 Jointly 的预测单张图像中的多个属性。所提出的方法，是受到现有方法的启发，但是在一个网络中，考虑到 attribute correlation 以及 attribute heterogeneity。所提出的 DMTL 有前期的共享特征提取阶段，以及 特定类型的特征学习来进行多个属性的预测。共享的特征学习自然地探索了多个 task 之间的相关性，可以更加鲁棒以及有效的进行特征的表达。 Main Contributions： (i) an efficient multi-task learning (MTL) method for joint estimation of a large number of face attributes; (ii) modeling both attribute correlation and attribute heterogeneity in a single network; (iii) studying the generalization ability of the proposed approach under cross-database testing scenarios; (iii) compiling the LFW+ database2 with face images in the wild (LFW), and heterogeneous demographic attributes (age, gender, and race) via crowdsourcing. Proposed Approach： 1. Deep Multi-task Learning : 本文的目标是，用一个联合的预测模型，同时预测多个人脸属性。当大量 face attributes 给特征学习效率上带来挑战的同时，他们也提供了结合属性内部关系的机会（leveraging the attribute inter-correlations to obtain informative and robust feature representation）。例如，CelebA dataset 中的各个属性之间就有很强的 correlation，如下图所示： 那么，采用 多任务的框架来学习这个东西，就变的特别直觉了。但是，外观变换的出现 以及 the heterogeneity of individual attributes, 从 face image space 到 attribute space 的映射，通常是 nonlinear。所以， the joint attribute estimation model 应该可以捕获到复杂和综合的非线性变换。CNN model 是一种有效的处理 MTL 以及 nonlinear transformation learning 的方法。所以，我们选择基于 CNN 的 多任务框架来完成该任务： 一个传统的 DMTL model 进行联合的属性预测可以 formulated by minimizing the regularization error function： 上述 model 就是：重构 loss + 正则化项的标准做法。但是这种方法不是最优的，因为属性之间的关系并没有考虑到，而属性的预测应该共享某些 feature。这也是被其他 paper 所支持的【34】。但是，公式 1 当中的表达方式，并没有显示的强调了 a large portion of feature sharing during MTL。我们将上述表达式改为下面的形式： 其中，Wc 控制了人脸属性共享的 feature，Wj 控制了共享 feature 的更新。Specifically, as shown in Fig. 2, a face image is first projected to a high-level representation through a shared deep network (Wc) consisting of a cascade of complex non-linear mappings, and then refined by shallow subnetworks ({Wj}M j=1) towards individual attribute estimation tasks。 Heterogeneous Face Attributes Estimation： 尽管上述 DMTL 在特征学习过程中用到了 attribute correlations，the attribute heterogeneity 仍然需要考虑。单个 face Attribute 的异质性曾经被提出过，但没有受到足够多的关注。原因是如下两个方面： 1. many of the public-domain face databases are labeled with a single attribute, the requirement of designing corresponding models becomes no longer urgent ; 2. many of the published methods choose to learn a separate model for each face attribute; model learning for individual attributes does not face the attribute heterogeneity problem. 我们分别对待每一个 异质的属性类别（the heterogeneous attribute categories），但是每一个类别的 attributes 都希望能够共享 feature learning 以及 classification model。为了完成这个，我们重写了目标函数： 其中，G 是异质属性类别的个数。 将大量属性进行几个 heterogeneous categories 的划分，依赖于 prior knowledge。此处，我们从 data type and scale (i.e. ordinal vs. nominal) 以及 semantic meaning (i.e. holistic vs. local) 考虑 face attribute heterogeneities，然后解释我们的 特定类别的建模，来进行这些 heterogeneous attribute categories。 Nominal vs. ordinal attributes .
到底什么是 ROI Pooling Layer ??? 只知道 faster rcnn 中有 ROI pooling, 而且其他很多算法也都有用这个layer 来做一些事情，如：SINT，检测的文章等等。那么，到底什么是 ROI pooling 呢？？？ 参考：http://blog.csdn.net/lanran2/article/details/60143861 在 faster rcnn 中，RPN 会产生很多的候选 proposal，这里出来的是 BBox 的位置，也就是我们感兴趣的区域，即： region of interest (ROI) 。 ROI pooling 操作的对象就是这些 proposal （BBOx）。 ==>> ROI Pooling 的输入是： 1. RPN layer 前面的 feature map， 2. RPN 输出的 BBOx，形状为：1*5*1*1 (4个坐标 + 索引 index)； ==>> ROI Pooling 的输出是： mini-batch 个 vector，batch的值是 ROI 的个数，vector的长度为：channel * w * h； 整个 ROI 的过程，就是将这些 proposal 抠出来的过程，得到大小统一的 feature map。
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments 2017-10-25 16:38:23 【Project Page】https://blog.openai.com/learning-to-cooperate-compete-and-communicate/ 4. Method 4.1 Multi-Agent Actor Critic 该网络框架有如下假设条件： (1) the learned policies can only use local information (i.e. their own observations) at execution time, (2) we do not assume a differentiable model of the environment dynamics, unlike in [24], (3) we do not assume any particular structure on the communication method between agents (that is, we don’t assume a differentiable communication channel). ================>>> 1. 学习到的策略在执行时，仅仅是利用局部的信息 2. 我们不假设环境动态的可微分模型 3. 我们不假设 agents 之间任何通信模型上的特定结构 本文的模型是以 centralized training with decentralized execution framework 为基础进行的，而这个框架的意思是：以全局的信息进行训练，而实际测试的时候是分散执行的。 更具体的来说，我们考虑有 N 个 agent 的游戏，所以，每个 agent i 的期望汇报可以记为： 此处的 Q 函数 是一个中心化的动作值函数（centralized action-value function），将所有 agent 的动作作为输入，除了某些状态信息 X，然后输出是 the Q-value for agent i。 在最简单的情况下，x 可以包含所有 agent 的观测，x = (o1, ... , oN)，但是我们也可以包含额外的状态信息。由于每一个 Q 都是分别学习的，agent 可以拥有任意的奖励结构，包括在竞争设定下的冲突奖励。 我们可以将上述 idea 拓展到 deterministic policies。如果我们考虑到 N 个连续的策略，那么梯度可以写作： 此处，经验回放池 D 包括 the tuples (x, x', a1, ... , aN, r1, ... , rN)，记录所有 agents 的经验。中心化的动作值函数 Q可以通过如下的方程，进行更新： 4.2 Inferring Policies of Other Agents 为了移除假设：knowing other agents' policies, 就像公式（6）中所要求的那样。每一个 agent i 可以估计 agent j 的真实策略。这个估计的策略可以通过最大化 agent 选择动作的 log 概率，且加上一个 entropy regularizer： 其中，H 是策略分布的熵。有了估计的策略，公式（6）中的 y 可以用估计的值 y^ 来进行计算： 其中，\mu’ 代表用来估计策略的 target network。注意到，公式（7）可以完全在线的执行，before updating $Q_i^{\mu}$, the centralized Q function, 我们采取每一个 agent j 的最新的样本，from the replay buffer to perform a single gradient step to update $\phi^j_i$。另外，在上述公式中，我们直接将每个 agent 的动作 log 概率输入到 Q，而不是 sampling。 4.3 Agents with Policy Ensembles
论文笔记： Dual Deep Network for Visual Tracking 2017-10-17 21:57:08 先来看文章的流程吧 。。。 可以看到，作者所总结的三个点在于： 1. 文章将 边界和形状信息结合到深度网络中。底层 feature 和 高层 feature 结合起来，得到 coarse prior map，然后用 ICA-R model 得到更加显著的物体轮廓，以得到更好的似然性模型； 2. Dual network 分别处理两路不同的网络，使得前景和背景更加具有区分性； 3. 随机和周期性的更新机制，用来处理遮挡和模型漂移。 ======================================================================================
Adam Kosiorek About Attention in Neural Networks and How to Use It this blog comes from: http://akosiorek.github.io/ml/2017/10/14/visual-attention.html Oct 14, 2017 Attention mechanisms in neural networks, otherwise known as neural attention or just attention, have recently attracted a lot of attention (pun intended). In this post, I will try to find a common denominator for different mechanisms and use-cases and I will describe (and implement!) two mechanisms of soft visual attention. What is Attention? Informally, a neural attention mechanism equips a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. Let x∈Rdx∈Rd an input, z∈Rkz∈Rk a feature vector, a∈[0,1]ka∈[0,1]k an attention vector and fϕ(x)fϕ(x) an attention network. Typically, attention is implemented as aza=fϕ(x),=a⊙z,(1)(1)a=fϕ(x),za=a⊙z, where ⊙⊙ is element-wise multiplication. We can talk about soft attention, which multiplies features with a (soft) mask of values between zero and one, or hard attention, when those values are constrained to be exactly zero or one, namely a∈{0,1}ka∈{0,1}k. In the latter case, we can use the hard attention mask to directly index the feature vector: za=z[a]za=z[a] (in Matlab notation), which changes its dimensionality. To understand why attention is important, we have to think about what a neural network really is: a function approximator. Its ability to approximate different classes of functions depends on its architecture. A typical neural net is implemented as a chain of matrix multiplications and element-wise non-linearities, where elements of the input or feature vectors interact with each other only by addition. Attention mechanisms compute a mask which is used to multiply features. This seemingly innocent extension has profound implications: suddenly, the space of functions that can be well approximated by a neural net is vastly expanded, making entirely new use-cases possible. Why? While I have no proof, the intuition is following: the theory says that neural networks are universal function approximators and can approximate an arbitrary function to arbitrary precision, but only in the limit of an infinite number of hidden units. In any practical setting, that is not the case: we are limited by the number of hidden units we can use. Consider the following example: we would like to approximate the product of NN inputs. A feed-forward neural network can do it only by simulating multiplications with (many) additions (plus non-linearities), and thus it requires a lot of neural-network real estate. If we introduce multiplicative interactions, it becomes simple and compact. The above definition of attention as multiplicative interactions allow us to consider a broader class of models if we relax the constrains on the values of the attention mask and let a∈Rka∈Rk. For example,Dynamic Filter Networks (DFN) use a filter-generating network, which computes filters (or weights of arbitrary magnitudes) based on inputs, and applies them to features, which effectively is a multiplicative interaction. The only difference with soft-attention mechanisms is that the attention weights are not constrained to lie between zero and one. Going further in that direction, it would be very interesting to learn which interactions should be additive and which multiplicative, a concept explored in A Differentiable Transition Between Additive and Multiplicative Neurons. The excellent distill blog provides a great overview of soft-attention mechanisms. Visual Attention Attention can be applied to any kind of inputs, regardless of their shape. In the case of matrix-valued inputs, such as images, we can talk about visual attention. Let I∈RH×WI∈RH×W be an image and g∈Rh×wg∈Rh×wan attention glimpse i.e. the result of applying an attention mechanism to the image II. Hard Attention Hard attention for images has been known for a very long time: image cropping. It is very easy conceptually, as it only requires indexing. Hard-attention can be implemented in Python (or Tensorflow) as g = I[y:y+h, x:x+w] The only problem with the above is that it is non-differentiable; to learn the parameters of the model, one must resort to e.g. the score-function estimator, briefly mentioned in my previous post. Soft Attention Soft attention, in its simplest variant, is no different for images than for vector-valued features and is implemented exactly as in equation 11. One of the early uses of this types of attention comes from the paper called Show, Attend and Tell: The model learns to attend to specific parts of the image while generating the word describing that part. This type of soft attention is computationally wasteful, however. The blacked-out parts of the input do not contribute to the results but still need to be processed. It is also over-parametrised: sigmoid activations that implement the attention are independent of each other. It can select multiple objects at once, but in practice we often want to be selective and focus only on a single element of the scene. The two following mechanisms, introduced by DRAW and Spatial Transformer Networks, respectively, solve this issue. They can also resize the input, leading to further potential gains in performance. Gaussian Attention Gaussian attention works by exploiting parametrised one-dimensional Gaussian filters to create an image-sized attention map. Let ay∈Rhay∈Rh and ax∈Rwax∈Rw be attention vectors, which specify which part of the image should be attended to in yy and xx axis, respectively. The attention masks can be created as a=ayaTxa=ayaxT. In the above figure, the top row shows axax, the column on the right shows ayay and the middle rectangle shows the resulting aa. Here, for the visualisation purposes, the vectors contain only zeros and ones. In practice, they can be implemented as vectors of one-dimensional Gaussians. Typically, the number of Gaussians is equal to the spatial dimension and each vector is parametrised by three parameters: centre of the first Gaussian μμ, distance between centres of consecutive Gaussians dd and the standard deviation of the Gaussians σσ. With this parametrisation, both attention and the glimpse are differentiable with respect to attention parameters, and thus easily learnable. Attention in the above form is still wasteful, as it selects only a part of the image while blacking-out all the remaining parts. Instead of using the vectors directly, we can cast them into matrices Ay∈Rh×HAy∈Rh×Hand Ax∈Rw×WAx∈Rw×W, respectively. Now, each matrix has one Gaussian per row and the parameter ddspecifies distance (in column units) between centres of Gaussians in consecutive rows. Glimpse is now implemented as g=AyIATx.g=AyIAxT. I used this mechanism in HART, my recent paper on biologically-inspired object tracking with RNNs with attention. Here is an example with the input image on the left hand side and the attention glimpse on the right hand side; the glimpse shows the box marked in the main image in green: The code below lets you create one of the above matrix-valued masks for a mini-batch of samples in Tensorflow. If you want to create AyAy, you would call it as Ay = gaussian_mask(u, s, d, h, H), where u, s, d are μ,σμ,σ and dd, in that order and specified in pixels. def gaussian_mask(u, s, d, R, C): """ :param u: tf.Tensor, centre of the first Gaussian. :param s: tf.Tensor, standard deviation of Gaussians. :param d: tf.Tensor, shift between Gaussian centres. :param R: int, number of rows in the mask, there is one Gaussian per row. :param C: int, number of columns in the mask. """ # indices to create centres R = tf.to_float(tf.reshape(tf.range(R), (1, 1, R))) C = tf.to_float(tf.reshape(tf.range(C), (1, C, 1))) centres = u[np.newaxis, :, np.newaxis] + R * d column_centres = C - centres mask = tf.exp(-.5 * tf.square(column_centres / s)) # we add eps for numerical stability normalised_mask /= tf.reduce_sum(mask, 1, keep_dims=True) + 1e-8 return normalised_mask We can also write a function to directly extract a glimpse from the image: def gaussian_glimpse(img_tensor, transform_params, crop_size): """ :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels) :param transform_params: tf.Tensor of size (batch_size, 6), where params are (mean_y, std_y, d_y, mean_x, std_x, d_x) specified in pixels. :param crop_size): tuple of 2 ints, size of the resulting crop """ # parse arguments h, w = crop_size H, W = img_tensor.shape.as_list()[1:3] uy, sy, dy, ux, sx, dx = tf.split(transform_params, 6, -1) # create Gaussian masks, one for each axis Ay = mask(uy, sy, dy, h, H) Ax = mask(ux, sx, dx, w, W) # extract glimpse glimpse = tf.matmul(tf.matmul(Ay, img_tensor, adjoint_a=True), Ax) return glimpse Spatial Transformer Spatial Transformer (STN) allows for much more general transformation that just differentiable image-cropping, but image cropping is one of the possible use cases. It is made of two components: a grid generator and a sampler. The grid generator specifies a grid of points to be sampled from, while the sampler, well, samples. The Tensorflow implementation is particularly easy in Sonnet, a recent neural network library from DeepMind. def spatial_transformer(img_tensor, transform_params, crop_size): """ :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels) :param transform_params: tf.Tensor of size (batch_size, 4), where params are (scale_y, shift_y, scale_x, shift_x) :param crop_size): tuple of 2 ints, size of the resulting crop """ constraints = snt.AffineWarpConstraints.no_shear_2d() img_size = img_tensor.shape.as_list()[1:] warper = snt.AffineGridWarper(img_size, crop_size, constraints) grid_coords = warper(transform_params) glimpse = snt.resampler(img_tensor, grid_coords) return glimpse Gaussian Attention vs. Spatial Transformer Both Gaussian attention and Spatial Transformer can implement a very similar behaviour. How do we choose which to use? There are several nuances: Gaussian attention is an over-parametrised cropping mechanism: it requires six parameters, but there are only four degrees of freedom (y, x, height width). STN needs only four parameters. I haven’t run any tests yet, but STN should be faster. It relies on linear interpolation at sampling points, while the Gaussian attention has to perform two huge matrix multiplication. STN could be an order of magnitude faster (in terms of pixels in the input image). Gaussian attention should be (no tests run) easier to train. This is because every pixel in the resulting glimpse can be a convex combination of a relatively big patch of pixels of the source image, which (informally) makes it easier to find the cause of any errors. STN, on the other hand, relies on linear interpolation, which means that gradient at every sampling point is non-zero only with respect to the two nearest pixels. Closing Thoughts Attention mechanisms expand capabilities of neural networks: they allow approximating more complicated functions, or in more intuitive terms, they enable focusing on specific parts of the input. They have led to performance improvements on natural language benchmarks, as well as to entirely new capabilities such as image captioning, addressing in memory networks and neural programmers. I believe that the most important cases in which attention is useful have not been yet discovered. For example, we know that objects in videos are consistent and coherent, e.g. they do not disappear into thin air between frames. Attention mechanisms can be used to express this consistency prior. How? Stay tuned. Share this: Facebook Twitter Google+ LinkedIn Adam Kosiorek adamk@robots.ox.ac.uk github linkedin Generative timeseries modelling, but also attention, memory and other cool tricks.
Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking 本文目标在于 tracking performance 和 efficiency 之间达到一种平衡。将 tracking 过程分解为两个并行但是相互协作的部分： 一个用于快速的跟踪（fast tracking）； 另一个用于准确的验证（accurate verification）。 本文的 Motivation 主要是： 1. 大部分跟踪的序列，都是比较平坦简单的，但是存在有些非常具有挑战性的片段的存在，使得跟踪的结果不是非常的好。如果处理不好，还会导致跟踪的丢失。本文利用 verifiers 将进行这些关键点的处理。 2. 计算机视觉当中多线程计算已经非常普遍，特别是 SLAM。By splitting tracking and mapping into two parallel threads, PTAM (parallel tracking and mapping) [23] provides one of the most popular SLAM frameworks with many important extensions. 3. 最近快速、准确的跟踪算法提供了有效的 building blocks，并且鼓励我们去寻找组合的解决方法（呵呵了。。。） 创新点： 1. we propose to build real-time high accuracy trackers in a novel framework named parallel tracking and verifying (PTAV). 2. The key idea is : while T needs to run on every frame, V does not. As a general framework, PTAV allows the coordination between the tracker and the verifier: V checks thetracking results provided by T and sends feedback to V; and V adjusts itself according to the feedback when necessary. By running T and V in parallel, PTAV inherits both the highefficiency of T and the strong discriminative power of V. ========== 分割线 ========= ======== 以上是 PTAV framework 的流程图，也是两个 tracker 和 verifiers 之间互相协助的过程。 PTAV Implementation： 1. Tracking 的过程就是利用了 fDSST 跟踪算法，没啥好说的；但是不同的是， the tracker in this paper，存储了所有的中间结果，since sending out last verification request to ensure fast tracing back. 2. Verifying 是采用了 Siamese network。 ==>> 当从 tracking 过程中得到的跟踪结果，如果其验证得分低于一个阈值，那么 V 就认为该跟踪结果不可靠，或者说认为已经跟踪失败了。 此时，V 利用Siamese network，在进行一次检测。具体做法就是利用 region pooling layer 进行一次前传，然后得到许多候选的样本，然后从中选择最好的那个作为检测的结果： 当有了这些检测结果之后，我们在进行一次 check，确认下检测结果是否可信？ 其实就是根据检测的置信度和某一阈值进行比较，如果不符合要求，就放大搜索区域，进行再一次的搜索。 ============================= 算法部分完毕 实验结果： 想想真可怕，作者居然不辞劳苦的跑了四个数据集。。。
The care and maintenance of your adviser Ever since the advent of graduate school, students have complained about their advisers. It is almost an article of faith. The adviser is never available or is too available; gives too much feedback or not enough; is too critical or isn't providing enough direction; and so on. Exchanging horror stories with other students is a great way to bond. But advising goes both ways — and if, after careful reflection on their own studies and progress, students determine that they are not getting the guidance they require, they must address the deficiencies. It is not surprising that advisers figure large in graduate students' conversations. In 2009, the US Council of Graduate Schools in Washington DC reported survey results showing that 65% of the 1,856 doctoral students who responded identified mentoring or advising as a main factor in PhD completion. Our own research at Flinders University in Adelaide, Australia, and our experience at graduate-student workshops across the world suggest that the adviser–student relationship has a big impact on completion time. It certainly influences whether students are still smiling at the end of their degrees! Students often assume that once they call someone an adviser, he or she automatically acquires all the skills of advising. After all, if your adviser is the world leader in stem-cell technology, he or she must excel at the seemingly simple task of advising — not to mention possess highly developed interpersonal skills and a keen interest in graduate-student development. Sadly, that is not the case. IMAGES.COM/CORBIS Sometimes, advising is a weakness of an otherwise very accomplished scientist. This is not surprising. Mentoring tends to be a private business, and often the only model available is an adviser's own experience of having been advised. If it was good, they decide to copy that style and methodology; if it was bad, they do the opposite. There is no guarantee that either approach will provide the student with the guidance he or she needs. A proactive approach is necessary. If your adviser isn't looking after you in the way you need, then you need to look after them. At some point in the PhD journey, most graduate students come to an important realization: “This is my thesis. My name is written on the front of it. I need to become the driver.” The sooner the candidate does this, the better. If you're not getting feedback, clear direction or the necessary resources, then you must do something about it. What does this mean in practice? Let us take some examples. Meetings A comment we often hear at our workshops is, “My adviser is lovely but he/she is just so busy that we never get to talk about my thesis”. And our response is, “Yes, your adviser is busy. All advisers are busy and will continue to be busy. Regardless, you need to organize meetings where you can get real face time and talk about your thesis.” We're not recommending a quick chat in the coffee room or a brief word in the lab. Nor do we mean a lab meeting. We mean regularly scheduled meetings focusing on your thesis. You will probably have to schedule them and follow up to make sure that they happen. And when a meeting is cancelled, you will have to reschedule it and persist until it happens. In our experience, just scheduling the meeting isn't enough. You can't assume that your adviser hosts productive meetings or can intuit what you need to know. You need a specific, uncomplicated agenda that could include such action items as what you've done in the past two weeks; feedback on written work; what you'll do in the next two weeks; the next meeting. This all sounds very straightforward. But if more students followed these steps, many adviser–student issues could be resolved. Feedback Again, in an ideal world, your adviser would be skilled at providing supportive comments, delicate in pointing out areas for improvement and deft at intuitively knowing the level of feedback you seek. But this is a fantasy. One student described her feedback experience as similar to being a victim in a drive-by shooting — she handed over her work, it was riddled with bullets and she was left with a bloodied mess as the shooter drove off. To be fair, e-mailing a chapter to an adviser and saying “Give me feedback” is like walking into a restaurant and saying “Give me food.” You need to be a bit more specific. When handing over your work, identify the type of feedback you are looking for. You might say, “This is an early draft, so I just want feedback on the overall direction,” or “Please focus on the discussion on page six.” If the feedback you get isn't helpful, ask for more detail. Maintaining your adviser means asking for what you need rather than hoping that he or she will know what to provide. Managing up One of the secrets of looking after your adviser is working out what they want — and what most advisers want is a student who comes to them with suggestions and solutions as well as problems, gets things done and makes the job of advising easier. In business this is called 'managing up'. When we work with graduate students we call it the 'care and maintenance' of your adviser. So although it is natural to complain about your adviser — and can even be cathartic — it is not enough. If your adviser is not giving you what you need, you need to go out and get it. Reference： https://www.nature.com/naturejobs/science/articles/10.1038/nj7331-570a
Basic Mathematics You Should Mastered 2017-08-17 21:22:40 1. Statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. 2. Pinsker's inequality In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of theKullback–Leibler divergence. The inequality is tight up to constant factors.[1] 3. Total variation distance of probability measures 4. Total variation distance of probability measures 5. σ-代数 6. The definition of TV:
Ahmet Taspinar Home About Contact Building Convolutional Neural Networks with Tensorflow Posted on augustus 15, 2017 adminPosted in convolutional neural networks, deep learning, tensorflow 1. Introduction In the past I have mostly written about ‘classical’ Machine Learning, like Naive Bayes classification, Logistic Regression, and the Perceptron algorithm. In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications. For this you will need to have tensorflow installed (see installation instructions) and you should also have a basic understanding of Python programming and the theory behind Convolutional Neural Networks. After you have installed tensorflow, you can run the smaller Neural Networks without GPU, but for the deeper networks you will definitely need some GPU power.The Internet is full with awesome websites and courses which explain how a convolutional neural network works. Some of them have good visualisations which make it easy to understand [click here for more info]. I don’t feel the need to explain the same things again, so before you continue, make sure you understand how a convolutional neural network works. For example, What is a convolutional layer, and what is the filter of this convolutional layer? What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)? What is a pooling layer (max pooling / average pooling), dropout? How does Stochastic Gradient Descent work? The contents of this blog-post is as follows: Tensorflow basics: 1.1 Constants and Variables 1.2 Tensorflow Graphs and Sessions 1.3 Placeholders and feed_dicts Neural Networks in Tensorflow 2.1 Introduction 2.2 Loading in the data 2.3 Creating a (simple) 1-layer Neural Network: 2.4 The many faces of Tensorflow 2.5 Creating the LeNet5 CNN 2.6 How the parameters affect the outputsize of an layer 2.7 Adjusting the LeNet5 architecture 2.8 Impact of Learning Rate and Optimizer Deep Neural Networks in Tensorflow 3.1 AlexNet 3.2 VGG Net-16 3.3 AlexNet Performance Final words 1. Tensorflow basics: Here I will give a short introduction to Tensorflow for people who have never worked with it before. If you want to start building Neural Networks immediatly, or you are already familiar with Tensorflow you can go ahead and skip to section 2. If you would like to know more about Tensorflow, you can also have a look atthis repository, or the notes of lecture 1 and lecture 2 of Stanford’s CS20SI course. 1.1 Constants and Variables The most basic units within tensorflow are Constants, Variables and Placeholders. The difference between a tf.constant() and a tf.Variable() should be clear; a constant has a constant value and once you set it, it cannot be changed. The value of a Variable can be changed after it has been set, but the type and shape of the Variable can not be changed. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #We can create constants and variables of different types. #However, the different types do not mix well together. a = tf.constant(2, tf.int16) b = tf.constant(4, tf.float32) c = tf.constant(8, tf.float32) d = tf.Variable(2, tf.int16) e = tf.Variable(4, tf.float32) f = tf.Variable(8, tf.float32) #we can perform computations on variable of the same type: e + f #but the following can not be done: d + e #everything in tensorflow is a tensor, these can have different dimensions: #0D, 1D, 2D, 3D, 4D, or nD-tensors g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32)) #does work h = tf.zeros([11], tf.int16) i = tf.ones([2,2], tf.float32) j = tf.zeros([1000,4,3], tf.float64) k = tf.Variable(tf.zeros([2,2], tf.float32)) l = tf.Variable(tf.zeros([5,6,5], tf.float32)) Besides the tf.zeros() and tf.ones(), which create a Tensor initialized to zero or one (see here), there is also the tf.random_normal() function which create a tensor filled with values picked randomly from a normal distribution (the default distribution has a mean of 0.0 and stddev of 1.0).There is also the tf.truncated_normal() function, which creates an Tensor with values randomly picked from a normal distribution, where two times the standard deviation forms the lower and upper limit. With this knowledge, we can already create weight matrices and bias vectors which can be used in a neural network. 1 2 3 4 5 6 weights = tf.Variable(tf.truncated_normal([256 * 256, 10])) biases = tf.Variable(tf.zeros([10])) print(weights.get_shape().as_list()) print(biases.get_shape().as_list()) >>>[65536, 10] >>>[10] 1.2. Tensorflow Graphs and Sessions In Tensorflow, all of the different Variables and the operations done on these Variables are saved in a Graph. After you have build a Graph which contains all of the computational steps necessary for your model, you can run this Graph within a Session. This Session then distributes all of the computations across the available CPU and GPU resources. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 graph = tf.Graph() with graph.as_default(): a = tf.Variable(8, tf.float32) b = tf.Variable(tf.zeros([2,2], tf.float32)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print(f) print(session.run(f)) print(session.run(k)) >>> <tf.Variable 'Variable_2:0' shape=() dtype=int32_ref> >>> 8 >>> [[ 0. 0.] >>> [ 0. 0.]] 1.3 Placeholders and feed_dicts We have seen the various forms in which we can create constants and variables. Tensorflow also has placeholders; these do not require an initial value and only serve to allocate the necessary amount of memory. During a session, these placeholder can be filled in with (external) data with a feed_dict. Below is an example of the usage of a placeholder. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]] list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]] list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_]) list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_]) graph = tf.Graph() with graph.as_default(): #we should use a tf.placeholder() to create a variable whose value you will fill in later (during session.run()). #this can be done by 'feeding' the data into the placeholder. #below we see an example of a method which uses two placeholder arrays of size [2,1] to calculate the eucledian distance point1 = tf.placeholder(tf.float32, shape=(1, 2)) point2 = tf.placeholder(tf.float32, shape=(1, 2)) def calculate_eucledian_distance(point1, point2): difference = tf.subtract(point1, point2) power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) add = tf.reduce_sum(power2) eucledian_distance = tf.sqrt(add) return eucledian_distance dist = calculate_eucledian_distance(point1, point2) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() for ii in range(len(list_of_points1)): point1_ = list_of_points1[ii] point2_ = list_of_points2[ii] feed_dict = {point1 : point1_, point2 : point2_} distance = session.run([dist], feed_dict=feed_dict) print("the distance between {} and {} -> {}".format(point1_, point2_, distance)) >>> the distance between [[1 2]] and [[15 16]] -> [19.79899] >>> the distance between [[3 4]] and [[13 14]] -> [14.142136] >>> the distance between [[5 6]] and [[11 12]] -> [8.485281] >>> the distance between [[7 8]] and [[ 9 10]] -> [2.8284271] 2. Neural Networks in Tensorflow 2.1 Introduction The graph containing the Neural Network (illustrated in the image above) should contain the following steps: The input datasets; the training dataset and labels, the test dataset and labels (and the validation dataset and labels).The test and validation datasets can be placed inside a tf.constant(). And the training dataset is placed in a tf.placeholder() so that it can be feeded in batches during the training (stochastic gradient descent). The Neural Network model with all of its layers. This can be a simple fully connected neural network consisting of only 1 layer, or a more complicated neural network consisting of 5, 9, 16 etc layers. The weight matrices and bias vectors defined in the proper shape and initialized to their initial values. (One weight matrix and bias vector per layer.) The loss value: the model has as output the logit vector (estimated training labels) and by comparing the logit with the actual labels, we can calculate the loss value (with the softmax with cross-entropy function). The loss value is an indication of how close the estimated training labels are to the actual training labels and will be used to update the weight values. An optimizer, which will use the calculated loss value to update the weights and biases with backpropagation. 2.2 Loading in the data Let’s load the dataset which are going to be used to train and test the Neural Networks. For this we will download the MNIST and the CIFAR-10 dataset. The MNIST dataset contains 60.000 images of handwritten digits, where each image size is 28 x 28 x 1 (grayscale). The CIFAR-10 dataset contains 60.000 colour images (3 channels) – size 32 x 32 x 3 – of 10 different objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Since there are 10 different objects in each dataset, both datasets contain 10 labels. First, lets define some methods which are convenient for loading and reshaping the data into the necessary format. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def randomize(dataset, labels): permutation = np.random.permutation(labels.shape[0]) shuffled_dataset = dataset[permutation, :, :] shuffled_labels = labels[permutation] return shuffled_dataset, shuffled_labels def one_hot_encode(np_array): return (np.arange(10) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels, image_width, image_height, image_depth): np_dataset_ = np.array([np.array(image_data).reshape(image_width, image_height, image_depth) for image_data in dataset]) np_labels_ = one_hot_encode(np.array(labels, dtype=np.float32)) np_dataset, np_labels = randomize(np_dataset_, np_labels_) return np_dataset, np_labels def flatten_tf_array(array): shape = array.get_shape().as_list() return tf.reshape(array, [shape[0], shape[1] * shape[2] * shape[3]]) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0]) These are methods for one-hot encoding the labels, loading the data in a randomized array and a method for flattening an array (since a fully connected network needs an flat array as its input): After we have defined these necessary function, we can load the MNIST and CIFAR-10 datasets with: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 mnist_folder = './data/mnist/' mnist_image_width = 28 mnist_image_height = 28 mnist_image_depth = 1 mnist_num_labels = 10 mndata = MNIST(mnist_folder) mnist_train_dataset_, mnist_train_labels_ = mndata.load_training() mnist_test_dataset_, mnist_test_labels_ = mndata.load_testing() mnist_train_dataset, mnist_train_labels = reformat_data(mnist_train_dataset_, mnist_train_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) mnist_test_dataset, mnist_test_labels = reformat_data(mnist_test_dataset_, mnist_test_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) print("There are {} images, each of size {}".format(len(mnist_train_dataset), len(mnist_train_dataset[0]))) print("Meaning each image has the size of 28*28*1 = {}".format(mnist_image_size*mnist_image_size*1)) print("The training set contains the following {} labels: {}".format(len(np.unique(mnist_train_labels_)), np.unique(mnist_train_labels_))) print('Training set shape', mnist_train_dataset.shape, mnist_train_labels.shape) print('Test set shape', mnist_test_dataset.shape, mnist_test_labels.shape) train_dataset_mnist, train_labels_mnist = mnist_train_dataset, mnist_train_labels test_dataset_mnist, test_labels_mnist = mnist_test_dataset, mnist_test_labels ###################################################################################### cifar10_folder = './data/cifar10/' train_datasets = ['data_batch_1', 'data_batch_2', 'data_batch_3', 'data_batch_4', 'data_batch_5', ] test_dataset = ['test_batch'] c10_image_height = 32 c10_image_width = 32 c10_image_depth = 3 c10_num_labels = 10 with open(cifar10_folder + test_dataset[0], 'rb') as f0: c10_test_dict = pickle.load(f0, encoding='bytes') c10_test_dataset, c10_test_labels = c10_test_dict[b'data'], c10_test_dict[b'labels'] test_dataset_cifar10, test_labels_cifar10 = reformat_data(c10_test_dataset, c10_test_labels, c10_image_size, c10_image_size, c10_image_depth) c10_train_dataset, c10_train_labels = [], [] for train_dataset in train_datasets: with open(cifar10_folder + train_dataset, 'rb') as f0: c10_train_dict = pickle.load(f0, encoding='bytes') c10_train_dataset_, c10_train_labels_ = c10_train_dict[b'data'], c10_train_dict[b'labels'] c10_train_dataset.append(c10_train_dataset_) c10_train_labels += c10_train_labels_ c10_train_dataset = np.concatenate(c10_train_dataset, axis=0) train_dataset_cifar10, train_labels_cifar10 = reformat_data(c10_train_dataset, c10_train_labels, c10_image_size, c10_image_size, c10_image_depth) del c10_train_dataset del c10_train_labels print("The training set contains the following labels: {}".format(np.unique(c10_train_dict[b'labels']))) print('Training set shape', train_dataset_cifar10.shape, train_labels_cifar10.shape) print('Test set shape', test_dataset_cifar10.shape, test_labels_cifar10.shape) You can download the MNIST dataset from Yann LeCun’s website. After you have downloaded and unzipped the files, you can load the data with the python-mnist tool. CIFAR-10 can be downloaded from here. 2.3 Creating a (simple) 1-layer Neural Network The most simple form of a Neural Network is a 1-layer linear Fully Connected Neural Network (FCNN). Mathematically it consists of a matrix multiplication.It is best to start with such a simple NN in tensorflow, and later on look at the more complicated Neural Networks. When we start looking at these more complicated Neural Networks, only the model (step 2) and weights (step 3) part of the Graph will change and the other steps will remain the same. We can make such an 1-layer FCNN as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 image_width = mnist_image_width image_height = mnist_image_height image_depth = mnist_image_depth num_labels = mnist_num_labels #the dataset train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.5 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized #as a default, tf.truncated_normal() is used for the weight matrix and tf.zeros() is used for the bias vector. weights = tf.Variable(tf.truncated_normal([image_width * image_height * image_depth, num_labels]), tf.float32) bias = tf.Variable(tf.zeros([num_labels]), tf.float32) #3) define the model: #A one layered fccd simply consists of a matrix multiplication def model(data, weights, bias): return tf.matmul(flatten_tf_array(data), weights) + bias logits = model(tf_train_dataset, weights, bias) #4) calculate the loss, which will be used in the optimization of the weights loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5) Choose an optimizer. Many are available. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) #6) The predicted values for the images in the train dataset and test dataset are assigned to the variables train_prediction and test_prediction. #It is only necessary if you want to know the accuracy by comparing it with the actual values. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, weights, bias)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % display_step == 0): train_accuracy = accuracy(predictions, train_labels[:, :]) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message) 1 2 3 4 5 6 7 8 9 10 11 12 >>> Initialized >>> step 0000 : loss is 2349.55, accuracy on training set 10.43 %, accuracy on test set 34.12 % >>> step 0100 : loss is 3612.48, accuracy on training set 89.26 %, accuracy on test set 90.15 % >>> step 0200 : loss is 2634.40, accuracy on training set 91.10 %, accuracy on test set 91.26 % >>> step 0300 : loss is 2109.42, accuracy on training set 91.62 %, accuracy on test set 91.56 % >>> step 0400 : loss is 2093.56, accuracy on training set 91.85 %, accuracy on test set 91.67 % >>> step 0500 : loss is 2325.58, accuracy on training set 91.83 %, accuracy on test set 91.67 % >>> step 0600 : loss is 22140.44, accuracy on training set 68.39 %, accuracy on test set 75.06 % >>> step 0700 : loss is 5920.29, accuracy on training set 83.73 %, accuracy on test set 87.76 % >>> step 0800 : loss is 9137.66, accuracy on training set 79.72 %, accuracy on test set 83.33 % >>> step 0900 : loss is 15949.15, accuracy on training set 69.33 %, accuracy on test set 77.05 % >>> step 1000 : loss is 1758.80, accuracy on training set 92.45 %, accuracy on test set 91.79 % This is all there is too it! Inside the Graph, we load the data, define the weight matrices and the model, calculate the loss value from the logit vector and pass this to the optimizer which will update the weights for ‘num_steps’ number of iterations. In the above fully connected NN, we have used the Gradient Descent Optimizer for optimizing the weights. However, there are many different optimizers available in tensorflow. The most common used optimizers are the GradientDescentOptimizer, AdamOptimizer and AdaGradOptimizer, so I would suggest to start with these if youre building a CNN.Sebastian Ruder has a nice blog post explaining the differences between the different optimizers which you can read if you want to know more about them. 2.4 The many faces of Tensorflow Tensorflow contains many layers, meaning the same operations can be done with different levels of abstraction. To give a simple example, the operationlogits = tf.matmul(tf_train_dataset, weights) + biases,can also be achieved withlogits = tf.nn.xw_plus_b(train_dataset, weights, biases). This is the best visible in the layers API, which is an layer with a high level of abstraction and makes it very easy to create Neural Network consisting of many different layers. For example, the conv_2d() or thefully_connected() functions create convolutional and fully connected layers. With these functions, the number of layers, filter sizes / depths, type of activation function, etc can be specified as a parameter. The weights and bias matrices are then automatically created, as well as the additional activation functions and dropout regularization layers. For example, with the layers API, the following lines: 1 2 3 4 5 6 7 8 import tensorflow as tf w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) layer1_conv = tf.nn.conv2d(data, w1, [1, 1, 1, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + b1) layer1_pool = tf.nn.max_pool(layer1_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') can be replaced with 1 2 3 4 from tflearn.layers.conv import conv_2d, max_pool_2d layer1_conv = conv_2d(data, filter_depth, filter_size, activation='relu') layer1_pool = max_pool_2d(layer1_conv_relu, 2, strides=2) As you can see, we don’t need to define the weights, biases or activation functions. Especially when youre building a neural network with many layers, this keeps the code succint and clean. However, if youre just starting out with tensorflow and want to learn how to build different kinds of Neural Networks, it is not ideal, since were letting tflearn do all the work.Therefore we will not use the layers API in this blog-post, but I do recommend you to use it once you have a full understanding of how a neural network should be build in tensorflow. 2.5 Creating the LeNet5 CNN Let’s start with building more layered Neural Network. For example the LeNet5 Convolutional Neural Network. The LeNet5 CNN architecture was thought of by Yann Lecun as early as in 1998 (see paper). It is one of the earliest CNN’s (maybe even the first?) and was specifically designed to classify handwritten digits. Although it performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, the performance drops on other datasets with more images, with a larger resolution (larger image size) and more classes. For these larger datasets, deeper ConvNets (like AlexNet, VGGNet or ResNet), will perform better. But since the LeNet5 architecture only consists of 5 layers, it is a good starting point for learning how to build CNN’s. The Lenet5 architecture looks as follows: As we can see, it consists of 5 layers: layer 1: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer. layer 2: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer. layer 3: a fully connected network (sigmoid activation) layer 4: a fully connected network (sigmoid activation) layer 5: the output layer This means that we need to create 5 weight and bias matrices, and our model will consists of 12 lines of code (5 layers + 2 pooling + 4 activation functions + 1 flatten layer).Since this is quiet some code, it is best to define these in a seperate function outside of the graph. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 LENET5_BATCH_SIZE = 32 LENET5_PATCH_SIZE = 5 LENET5_PATCH_DEPTH_1 = 6 LENET5_PATCH_DEPTH_2 = 16 LENET5_NUM_HIDDEN_1 = 120 LENET5_NUM_HIDDEN_2 = 84 def variables_lenet5(patch_size = LENET5_PATCH_SIZE, patch_depth1 = LENET5_PATCH_DEPTH_1, patch_depth2 = LENET5_PATCH_DEPTH_2, num_hidden1 = LENET5_NUM_HIDDEN_1, num_hidden2 = LENET5_NUM_HIDDEN_2, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([5*5*patch_depth2, num_hidden1], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w4 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w5 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.sigmoid(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='VALID') layer2_actv = tf.sigmoid(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.sigmoid(layer3_fccd) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.sigmoid(layer4_fccd) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits With the variables, and model defined seperately, we can adjust the the graph a little bit so that it uses these weights and model instead of the previous Fully Connected NN: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 #parameters determining the model size image_size = mnist_image_size num_labels = mnist_num_labels #the datasets train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.001 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized <strong>variables = variables_lenet5(image_depth = image_depth, num_labels = num_labels)</strong> #3. The model used to calculate the logits (predicted labels) <strong>model = model_lenet5</strong> <strong>logits = model(tf_train_dataset, variables)</strong> #4. then we compute the softmax cross entropy between the logits and the (actual) labels loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) if step % display_step == 0: train_accuracy = accuracy(predictions, batch_labels) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message) 1 2 3 4 5 6 7 8 9 10 11 12 >>> Initialized with learning_rate 0.1 >>> step 0000 : loss is 002.49, accuracy on training set 3.12 %, accuracy on test set 10.09 % >>> step 1000 : loss is 002.29, accuracy on training set 21.88 %, accuracy on test set 9.58 % >>> step 2000 : loss is 000.73, accuracy on training set 75.00 %, accuracy on test set 78.20 % >>> step 3000 : loss is 000.41, accuracy on training set 81.25 %, accuracy on test set 86.87 % >>> step 4000 : loss is 000.26, accuracy on training set 93.75 %, accuracy on test set 90.49 % >>> step 5000 : loss is 000.28, accuracy on training set 87.50 %, accuracy on test set 92.79 % >>> step 6000 : loss is 000.23, accuracy on training set 96.88 %, accuracy on test set 93.64 % >>> step 7000 : loss is 000.18, accuracy on training set 90.62 %, accuracy on test set 95.14 % >>> step 8000 : loss is 000.14, accuracy on training set 96.88 %, accuracy on test set 95.80 % >>> step 9000 : loss is 000.35, accuracy on training set 90.62 %, accuracy on test set 96.33 % >>> step 10000 : loss is 000.12, accuracy on training set 93.75 %, accuracy on test set 96.76 % As we can see the LeNet5 architecture performs better on the MNIST dataset than a simple fully connected NN. 2.6 How the parameters affect the outputsize of an layer Generally it is true that the more layers a Neural Network has, the better it performs. We can add more layers, change activation functions and pooling layers, change the learning rate and see how each step affects the performance. Since the input of layer is the output of layer , we need to know how the output size of layer is affected by its different parameters. To understand this, lets have a look at the conv2d() function. It has four parameters: The input image, a 4D Tensor with dimensions [batch size, image_width, image_height, image_depth] An weight matrix, a 4-D Tensor with dimensions [filter_size, filter_size, image_depth, filter_depth] The number of strides in each dimension. Padding (= ‘SAME’ / ‘VALID’) These four parameters determine the size of the output image. The first two parameters are the 4-D Tensor containing the batch of input images and the 4-D Tensor containing the weights of the convolutional filter. The third parameter is the stride of the convolution, i.e. how much the convolutional filter should skip positions in each of the four dimension. The first of these 4 dimensions indicates the image-number in the batch of images and since we dont want to skip over any image, this will always be 1. The last dimension indicates the image depth (no of color-channels; 1 for grayscale and 3 for RGB) and since we dont want to skip over any color-channels, this is also always 1. The second and third dimension indicate the stride in the X and Y direction (image width and height). If we want to apply a stride, these are the dimensions in which the filter should skip positions. So for a stride of 1, we have to set the stride-parameter to [1, 1, 1, 1] and if we want a stride of 2, set it to [1, 2, 2, 1]. etc The last parameter indicates whether or not tensorflow should zero-pad the image in order to make sure the output size does not change size for a stride of 1. With padding = ‘SAME’ the image does get zero-padded (and output size does not change), with padding = ‘VALID’ it does not. Below we can see two examples of a convolutional filter (with filter size 5 x 5) scanning through an image (of size 28 x 28).On the left the padding parameter is set to ‘SAME’, the image is zero-padded and the last 4 rows / columns are included in the output image.On the right padding is set to ‘VALID’, the image does not get zero-padded and the last 4 rows/columns are not included. GIF As we can see, without zero-padding the last four cells are not included, because the convolutional filter has reached the end of the (non-zero padded) image. This means that, for an input size of 28 x 28, the output size becomes 24 x 24. If padding = ‘SAME’, the output size is 28 x 28. This becomes more clear if we write down the positions of the filter on the image while it is scanning through the image (For simplicity, only the X-direction). With a stride of 1, the X-positions are 0-5, 1-6, 2-7, etc. If the stride is 2, the X-positions are 0-5, 2-7, 4-9, etc. If we do this for an image size of 28 x 28, filter size of 5 x 5 and strides 1 to 4, we will get the following table: As you can see, for a stride of 1, and zero-padding the output image size is 28 x 28. Without zero-padding the output image size becomes 24 x 24. For a filter with a stride of 2, these numbers are 14 x 14 and 12 x 12, and for a filter with stride 3 it is 10 x 10 and 8 x 8. etc For any arbitrary chosen stride S, filter size K, image size W, and padding-size P, the output size will be If padding = ‘SAME’ in tensorflow, the numerator always adds up to 1 and the output size is only determined by the stride S. 2.7 Adjusting the LeNet5 architecture In the original paper, a sigmoid activation function and average pooling were used in the LeNet5 architecture. However, nowadays, it is much more common to use a relu activation function. So let’s change the LeNet5 CNN a little bit to see if we can improve its accuracy. We will call this the LeNet5-like Architecture: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 LENET5_LIKE_BATCH_SIZE = 32 LENET5_LIKE_FILTER_SIZE = 5 LENET5_LIKE_FILTER_DEPTH = 16 LENET5_LIKE_NUM_HIDDEN = 120 def variables_lenet5_like(filter_size = LENET5_LIKE_FILTER_SIZE, filter_depth = LENET5_LIKE_FILTER_DEPTH, num_hidden = LENET5_LIKE_NUM_HIDDEN, image_width = 28, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) w2 = tf.Variable(tf.truncated_normal([filter_size, filter_size, filter_depth, filter_depth], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[filter_depth])) w3 = tf.Variable(tf.truncated_normal([(image_width // 4)*(image_width // 4)*filter_depth , num_hidden], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w4 = tf.Variable(tf.truncated_normal([num_hidden, num_hidden], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w5 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5_like(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.relu(layer3_fccd) #layer3_drop = tf.nn.dropout(layer3_actv, 0.5) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.relu(layer4_fccd) #layer4_drop = tf.nn.dropout(layer4_actv, 0.5) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits The main differences are that we are using a relu activation function instead of a sigmoid activation. Besides the activation function, we can also change the used optimizers to see what the effect is of the different optimizers on accuracy. 2.8 Impact of Learning Rate and Optimizer Lets see how these CNN’s perform on the MNIST and CIFAR-10 datasets. In the figures above, the accuracy on the test set is given as a function of the number of iterations. On the left for the one layer fully connected NN, in the middle for the LeNet5 NN and on the right for the LeNet5-like NN. As we can see, the LeNet5 CNN works pretty good for the MNIST dataset. Which should not be such a big surprise, since it was specially designed to classify handwritten digits. The MNIST dataset is quiet small and does not provide a big challenge, so even a one layer fully connected network performs quiet good. On the CIFAR-10 Dataset however, the performance for the LeNet5 NN drops significantly to accuracy values around 40%. To increase the accuracy, we can change the optimizer, or fine-tune the Neural Network by applying regularization or learning rate decay. As we can see, the AdagradOptimizer, AdamOptimizer and the RMSPropOptimizer have a better performance than the GradientDescentOptimizer. These are adaptive optimizers which in general perform better than the (simple) GradientDescentOptimizer but need more computational power. With L2-regularization or exponential rate decay we can probably gain a bit more accuracy, but for much better results we need to go deeper. 3. Deep Neural Networks in Tensorflow So far we have seen the LeNet5 CNN architecture. LeNet5 contains two convolutional layers followed by fully connected layers and therefore could be called a shallow Neural Network. At that time (in 1998) GPU’s were not used for computational calculations, and the CPU’s were not even that powerful so for that time the two convolutional layers were already quiet innovative. Later on, many other types of Convolutional Neural Networks have been designed, most of them much deeper [click here for more info].There is the famous AlexNet architecture (2012) by Alex Krizhevsky et. al., the 7-layered ZF Net (2013), and the 16-layered VGGNet (2014).In 2015 Google came with 22-layered CNN with an inception module (GoogLeNet), and Microsoft Research Asia created the 152-layered CNN called ResNet. Now, with the things we have learned so far, lets see how we can create the AlexNet and VGGNet16 architectures in Tensorflow. 3.1 AlexNet Although LeNet5 was the first ConvNet, it is considered to be a shallow neural network. It performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, but the performance drops when we’re trying to classify larger images, with more resolution and more classes. The first Deep CNN came out in 2012 and is called AlexNet after its creators Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Compared to the most recent architectures AlexNet can be considered simple, but at that time it was really succesfull. It won the ImageNet competition with a incredible test error rate of 15.4% (while the runner-up had an error of 26.2%) and started a revolution (also see this video) in the world of Deep Learning and AI. It consists of 5 convolutional layers (with relu activation), 3 max pooling layers, 3 fully connected layers and 2 dropout layers. The overall architecture looks as follows: layer 0: input image of size 224 x 224 x 3 layer 1: A convolutional layer with 96 filters (filter_depth_1 = 96) of size 11 x 11 (filter_size_1 = 11) and a stride of 4. It has a relu activation function.This is followed by max pooling and local response normalization layers. layer 2: A convolutional layer with 256 filters (filter_depth_2 = 256) of size 5 x 5 (filter_size_2 = 5) and a stride of 1. It has a relu activation function.This layer is also followed by max pooling and local response normalization layers. layer 3: A convolutional layer with 384 filters (filter_depth_3 = 384) of size 3 x 3 (filter_size_3 = 3) and a stride of 1. It has a relu activation function. layer 4: Same as layer 3. layer 5: A convolutional layer with 256 filters (filter_depth_4 = 256) of size 3 x 3 (filter_size_4 = 3) and a stride of 1. It has a relu activation function. layer 6-8: These convolutional layers are followed by fully connected layers with 4096 neurons each. In the original paper they are classifying a dataset with 1000 classes, but we will use the oxford17 dataset, which has 17 different classes (of flowers). Note that this CNN (or other deep CNN’s) cannot be used on the MNIST or the CIFAR-10 dataset, because the images in these datasets are too small. As we have seen before, a pooling layer (or a convolutional layer with a stride of 2) reduces the image size by a factor of 2. AlexNet has 3 max pooling layers and one convolutional layer with a stride of 4. This means that the original image size gets reduced by a factor of . The images in the MNIST dataset would simply get reduced to a size smaller than 0. Therefore we need to load a dataset with larger images, preferably 224 x 224 x 3 (as the original paper indicates). The 17 category flower dataset, aka oxflower17 dataset is ideal since it contains images of exactly this size: 1 2 3 4 5 6 7 8 9 10 11 12 ox17_image_width = 224 ox17_image_height = 224 ox17_image_depth = 3 ox17_num_labels = 17 import tflearn.datasets.oxflower17 as oxflower17 train_dataset_, train_labels_ = oxflower17.load_data(one_hot=True) train_dataset_ox17, train_labels_ox17 = train_dataset_[:1000,:,:,:], train_labels_[:1000,:] test_dataset_ox17, test_labels_ox17 = train_dataset_[1000:,:,:,:], train_labels_[1000:,:] print('Training set', train_dataset_ox17.shape, train_labels_ox17.shape) print('Test set', test_dataset_ox17.shape, test_labels_ox17.shape) Lets try to create the weight matrices and the different layers present in AlexNet. As we have seen before, we need as much weight matrices and bias vectors as the amount of layers, and each weight matrix should have a size corresponding to the filter size of the layer it belongs to. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 ALEX_PATCH_DEPTH_1, ALEX_PATCH_DEPTH_2, ALEX_PATCH_DEPTH_3, ALEX_PATCH_DEPTH_4 = 96, 256, 384, 256 ALEX_PATCH_SIZE_1, ALEX_PATCH_SIZE_2, ALEX_PATCH_SIZE_3, ALEX_PATCH_SIZE_4 = 11, 5, 3, 3 ALEX_NUM_HIDDEN_1, ALEX_NUM_HIDDEN_2 = 4096, 4096 def variables_alexnet(patch_size1 = ALEX_PATCH_SIZE_1, patch_size2 = ALEX_PATCH_SIZE_2, patch_size3 = ALEX_PATCH_SIZE_3, patch_size4 = ALEX_PATCH_SIZE_4, patch_depth1 = ALEX_PATCH_DEPTH_1, patch_depth2 = ALEX_PATCH_DEPTH_2, patch_depth3 = ALEX_PATCH_DEPTH_3, patch_depth4 = ALEX_PATCH_DEPTH_4, num_hidden1 = ALEX_NUM_HIDDEN_1, num_hidden2 = ALEX_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b3 = tf.Variable(tf.zeros([patch_depth3])) w4 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w5 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.zeros([patch_depth3])) pool_reductions = 3 conv_reductions = 2 no_reductions = pool_reductions + conv_reductions w6 = tf.Variable(tf.truncated_normal([(image_width // 2**no_reductions)*(image_height // 2**no_reductions)*patch_depth3, num_hidden1], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w7 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w8 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8 } return variables def model_alexnet(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 4, 4, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.max_pool(layer1_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer1_norm = tf.nn.local_response_normalization(layer1_pool) layer2_conv = tf.nn.conv2d(layer1_norm, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_relu = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer2_norm = tf.nn.local_response_normalization(layer2_pool) layer3_conv = tf.nn.conv2d(layer2_norm, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_relu = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_relu, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_relu = tf.nn.relu(layer4_conv + variables['b4']) layer5_conv = tf.nn.conv2d(layer4_relu, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_relu = tf.nn.relu(layer5_conv + variables['b5']) layer5_pool = tf.nn.max_pool(layer4_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer5_norm = tf.nn.local_response_normalization(layer5_pool) flat_layer = flatten_tf_array(layer5_norm) layer6_fccd = tf.matmul(flat_layer, variables['w6']) + variables['b6'] layer6_tanh = tf.tanh(layer6_fccd) layer6_drop = tf.nn.dropout(layer6_tanh, 0.5) layer7_fccd = tf.matmul(layer6_drop, variables['w7']) + variables['b7'] layer7_tanh = tf.tanh(layer7_fccd) layer7_drop = tf.nn.dropout(layer7_tanh, 0.5) logits = tf.matmul(layer7_drop, variables['w8']) + variables['b8'] return logits Now we can modify the CNN model to use the weights and layers of the AlexNet model in order to classify images. 3.2 VGG Net-16 VGG Net was created in 2014 by Karen Simonyan and Andrew Zisserman of the University of Oxford. It contains much more layers (16-19 layers), but each layer is simpler in its design; all of the convolutional layers have filters of size 3 x 3 and stride of 1 and all max pooling layers have a stride of 2.So it is a deeper CNN but simpler. It comes in different configurations, with either 16 or 19 layers. The difference between these two different configurations is the usage of either 3 or 4 convolutional layers after the second, third and fourth max pooling layer (see below). The configuration with 16 layers (configuration D) seems to produce the best results, so lets try to create that in tensorflow. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 #The VGGNET Neural Network VGG16_PATCH_SIZE_1, VGG16_PATCH_SIZE_2, VGG16_PATCH_SIZE_3, VGG16_PATCH_SIZE_4 = 3, 3, 3, 3 VGG16_PATCH_DEPTH_1, VGG16_PATCH_DEPTH_2, VGG16_PATCH_DEPTH_3, VGG16_PATCH_DEPTH_4 = 64, 128, 256, 512 VGG16_NUM_HIDDEN_1, VGG16_NUM_HIDDEN_2 = 4096, 1000 def variables_vggnet16(patch_size1 = VGG16_PATCH_SIZE_1, patch_size2 = VGG16_PATCH_SIZE_2, patch_size3 = VGG16_PATCH_SIZE_3, patch_size4 = VGG16_PATCH_SIZE_4, patch_depth1 = VGG16_PATCH_DEPTH_1, patch_depth2 = VGG16_PATCH_DEPTH_2, patch_depth3 = VGG16_PATCH_DEPTH_3, patch_depth4 = VGG16_PATCH_DEPTH_4, num_hidden1 = VGG16_NUM_HIDDEN_1, num_hidden2 = VGG16_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, patch_depth1, patch_depth1], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth1])) w3 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w4 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth2, patch_depth2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w5 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w6 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w7 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w8 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth4], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w9 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b9 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w10 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b10 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w11 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b11 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w12 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b12 = tf.Variable(tf.constant(1.0, shape=[patch_depth4])) w13 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b13 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) no_pooling_layers = 5 w14 = tf.Variable(tf.truncated_normal([(image_width // (2**no_pooling_layers))*(image_height // (2**no_pooling_layers))*patch_depth4 , num_hidden1], stddev=0.1)) b14 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w15 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b15 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w16 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b16 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'w9': w9, 'w10': w10, 'w11': w11, 'w12': w12, 'w13': w13, 'w14': w14, 'w15': w15, 'w16': w16, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8, 'b9': b9, 'b10': b10, 'b11': b11, 'b12': b12, 'b13': b13, 'b14': b14, 'b15': b15, 'b16': b16 } return variables def model_vggnet16(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer2_conv = tf.nn.conv2d(layer1_actv, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer3_conv = tf.nn.conv2d(layer2_pool, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_actv = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_actv, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_actv = tf.nn.relu(layer4_conv + variables['b4']) layer4_pool = tf.nn.max_pool(layer4_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer5_conv = tf.nn.conv2d(layer4_pool, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_actv = tf.nn.relu(layer5_conv + variables['b5']) layer6_conv = tf.nn.conv2d(layer5_actv, variables['w6'], [1, 1, 1, 1], padding='SAME') layer6_actv = tf.nn.relu(layer6_conv + variables['b6']) layer7_conv = tf.nn.conv2d(layer6_actv, variables['w7'], [1, 1, 1, 1], padding='SAME') layer7_actv = tf.nn.relu(layer7_conv + variables['b7']) layer7_pool = tf.nn.max_pool(layer7_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer8_conv = tf.nn.conv2d(layer7_pool, variables['w8'], [1, 1, 1, 1], padding='SAME') layer8_actv = tf.nn.relu(layer8_conv + variables['b8']) layer9_conv = tf.nn.conv2d(layer8_actv, variables['w9'], [1, 1, 1, 1], padding='SAME') layer9_actv = tf.nn.relu(layer9_conv + variables['b9']) layer10_conv = tf.nn.conv2d(layer9_actv, variables['w10'], [1, 1, 1, 1], padding='SAME') layer10_actv = tf.nn.relu(layer10_conv + variables['b10']) layer10_pool = tf.nn.max_pool(layer10_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer11_conv = tf.nn.conv2d(layer10_pool, variables['w11'], [1, 1, 1, 1], padding='SAME') layer11_actv = tf.nn.relu(layer11_conv + variables['b11']) layer12_conv = tf.nn.conv2d(layer11_actv, variables['w12'], [1, 1, 1, 1], padding='SAME') layer12_actv = tf.nn.relu(layer12_conv + variables['b12']) layer13_conv = tf.nn.conv2d(layer12_actv, variables['w13'], [1, 1, 1, 1], padding='SAME') layer13_actv = tf.nn.relu(layer13_conv + variables['b13']) layer13_pool = tf.nn.max_pool(layer13_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer13_pool) layer14_fccd = tf.matmul(flat_layer, variables['w14']) + variables['b14'] layer14_actv = tf.nn.relu(layer14_fccd) layer14_drop = tf.nn.dropout(layer14_actv, 0.5) layer15_fccd = tf.matmul(layer14_drop, variables['w15']) + variables['b15'] layer15_actv = tf.nn.relu(layer15_fccd) layer15_drop = tf.nn.dropout(layer15_actv, 0.5) logits = tf.matmul(layer15_drop, variables['w16']) + variables['b16'] return logits 3.3 AlexNet Performance As a comparison, have a look at the LeNet5 CNN performance on the larger oxflower17 dataset: 4. Final Words The code is also available in my GitHub repository, so feel free to use it on your own dataset(s). There is much more to explore in the world of Deep Learning; Recurrent Neural Networks, Region-Based CNN’s, GAN’s, Reinforcement Learning, etc. In future blog-posts I’ll build these types of Neural Networks, and also build awesome applications with what we have already learned.So subscribe and stay tuned! [1] If you feel like you need to refresh your understanding of CNN’s, here are some good starting points to get you up to speed: Machine Learning is fun! An Intuitive Explanation of Convolutional Neural Networks : CS231n Convolutional Neural Networks for Visual Recognition : Udacity’s Deep Learning course: Neural Networks and Deep Learning Ch 6. go back to top [2] If you want more information about the theory behind these different Neural Networks, Adit Deshpande’s blog post provides a good comparison of them with links to the original papers. Eugenio Culurciello has a nice blog and article worth a read. In addition to that, also have a look at this github repository containing awesome deep learning papers, and this github repository where deep learning papers are ordered by task and date. go back to top Delen: Klik om te delen via Twitter (Opent in een nieuw venster) Klik om te delen op Facebook (Opent in een nieuw venster) Klik om op Google+ te delen (Opent in een nieuw venster) Share This: FacebookTwitterRedditLinkedInBaiduSina WeiboDelen Post navigation Classification with Scikit-Learn
1. install the pytorch version 0.1.11 ## Version 0.1.11 ## python2.7 and cuda 8.0 pip install http://download.pytorch.org/whl/cu80/torch-0.1.11.post5-cp27-none-linux_x86_64.whl pip install torchvision 2. what happened when following errors occurs ??? Traceback (most recent call last): File "examples/triplet_loss.py", line 221, in <module> File "examples/triplet_loss.py", line 150, in main File "build/bdist.linux-x86_64/egg/reid/evaluators.py", line 118, in evaluate File "build/bdist.linux-x86_64/egg/reid/evaluators.py", line 21, in extract_features File "/usr/local/lib/python2.7/dist-packages/torch/utils_v2/data/dataloader.py", line 301, in __iter__ File "/usr/local/lib/python2.7/dist-packages/torch/utils_v2/data/dataloader.py", line 163, in __init__ File "/usr/local/lib/python2.7/dist-packages/torch/utils_v2/data/dataloader.py", line 226, in _put_indices File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 17, in send File "/usr/lib/python2.7/pickle.py", line 224, in dump File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/pickle.py", line 600, in save_list File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/pickle.py", line 600, in save_list File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/multiprocessing/forking.py", line 67, in dispatcher File "/usr/lib/python2.7/pickle.py", line 401, in save_reduce File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple File "/usr/lib/python2.7/pickle.py", line 286, in save File "/usr/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 113, in reduce_storage RuntimeError: unable to open shared memory object </torch_29419_2971992535> in read-write mode at /b/wheel/pytorch-src/torch/lib/TH/THAllocator.c:226 Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__ File "/usr/lib/python2.7/shutil.py", line 239, in rmtree File "/usr/lib/python2.7/shutil.py", line 237, in rmtree OSError: [Errno 24] Too many open files: '/tmp/pymp-QoKm2p' View Code 3. GPU 和 CPU 数据之间的转换： （1）CPU ---> GPU: a.cuda() （2）GPU ---> CPU: a.cpu() （3） torch.tensor ---> numpy array: a_numpy_style = a.numpy() （4）numpy array ---> torch.tensor: 1 >>> import numpy as np 2 >>> a = np.ones(5) 3 >>> b = torch.from_numpy(a) 4 >>> np.add(a, 1, out=a) 5 array([ 2., 2., 2., 2., 2.]) 6 >>> print(a) 7 [ 2. 2. 2. 2. 2.] 8 >>> print(b) 9 10 2 11 2 12 2 13 2 14 2 15 [torch.DoubleTensor of size 5] 16 17 >>> c=b.numpy() 18 >>> c 19 array([ 2., 2., 2., 2., 2.]) 4. Variable and Tensor: ==>> programs occured error: expected a Variable, but got a Float.Tensor(), ~~~~ ==>> this can be solved by adding: from torch.autograd import Variable hard_neg_differ_ = Variable(hard_neg_differ_) ==>> this will change the hard_neg_differ_ into a variable, not a Float.Tensor() any more. we can read this reference: http://blog.csdn.net/shudaqi2010/article/details/54880748 it tell us: >>> import torch >>> x = torch.Tensor(2,3,4) >>> x (0 ,.,.) = 1.00000e-37 * 2.4168 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (1 ,.,.) = 1.00000e-37 * 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [torch.FloatTensor of size 2x3x4] >>> from torch.autograd import Variable >>> x = Variable(x) >>> x Variable containing: (0 ,.,.) = 1.00000e-37 * 2.4168 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (1 ,.,.) = 1.00000e-37 * 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [torch.FloatTensor of size 2x3x4] View Code But, you can not directly convert the Variable to numpy() or something else. You can load the values in the Variable and convert to numpy() through: value = varable.data.numpy(). 5. Some Operations about tensor. obtained from blog: http://www.cnblogs.com/huangshiyu13/p/6672828.html ============改变数组的维度================== 已知reshape函数可以有一维数组形成多维数组 ravel函数可以展平数组 b.ravel() flatten()函数也可以实现同样的功能 区别：ravel只提供视图view，而flatten分配内存存储 重塑： 用元祖设置维度 >>> b.shape=(4,2,3) >>> b array(［[ 0, 1, 2], [ 3, 4, 5］, ［ 6, 7, 8], [ 9, 10, 11］, ［12, 13, 14], [15, 16, 17］, ［18, 19, 20], [21, 22, 23］]) 转置： >>> b array(［0, 1], [2, 3］) >>> b.transpose() array(［0, 2], [1, 3］) =============数组的组合============== >>> a array(［0, 1, 2], [3, 4, 5], [6, 7, 8］) >>> b = a*2 >>> b array(［ 0, 2, 4], [ 6, 8, 10], [12, 14, 16］) 1.水平组合 >>> np.hstack((a,b)) array(［ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16］) >>> np.concatenate((a,b),axis=1) array(［ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16］) 2.垂直组合 >>> np.vstack((a,b)) array(［ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16］) >>> np.concatenate((a,b),axis=0) array(［ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16］) 3.深度组合：沿着纵轴方向组合 >>> np.dstack((a,b)) array(［[ 0, 0], [ 1, 2], [ 2, 4］, ［ 3, 6], [ 4, 8], [ 5, 10］, ［ 6, 12], [ 7, 14], [ 8, 16］]) 4.列组合column_stack() 一维数组：按列方向组合 二维数组：同hstack一样 5.行组合row_stack() 以为数组：按行方向组合 二维数组：和vstack一样 6.==用来比较两个数组 >>> a==b array(［ True, False, False], [False, False, False], [False, False, False］, dtype=bool) #True那个因为都是0啊 ==================数组的分割=============== >>> a array(［0, 1, 2], [3, 4, 5], [6, 7, 8］) >>> b = a*2 >>> b array(［ 0, 2, 4], [ 6, 8, 10], [12, 14, 16］) 1.水平分割（难道不是垂直分割？？？） >>> np.hsplit(a,3) [array(［0], [3], [6］), array(［1], [4], [7］), array(［2], [5], [8］)] split(a,3,axis=1)同理达到目的 2.垂直分割 >>> np.vsplit(a,3) [array(［0, 1, 2］), array(［3, 4, 5］), array(［6, 7, 8］)] split(a,3,axis=0)同理达到目的 3.深度分割 某三维数组：：： >>> d = np.arange(27).reshape(3,3,3) >>> d array(［[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8］, ［ 9, 10, 11], [12, 13, 14], [15, 16, 17］, ［18, 19, 20], [21, 22, 23], [24, 25, 26］]) 深度分割后（即按照深度的方向分割） 注意：dsplite只对3维以上数组起作用 raise ValueError('dsplit only works on arrays of 3 or more dimensions') ValueError: dsplit only works on arrays of 3 or more dimensions >>> np.dsplit(d,3) [array(［[ 0], [ 3], [ 6］, ［ 9], [12], [15］, ［18], [21], [24］]), array(［[ 1], [ 4], [ 7］, ［10], [13], [16］, ［19], [22], [25］]), array(［[ 2], [ 5], [ 8］, ［11], [14], [17］, ［20], [23], [26］])] ===================数组的属性================= >>> a.shape #数组维度 (3, 3) >>> a.dtype #元素类型 dtype('int32') >>> a.size #数组元素个数 9 >>> a.itemsize #元素占用字节数 4 >>> a.nbytes #整个数组占用存储空间=itemsize*size 36 >>> a.T #转置=transpose array(［0, 3, 6], [1, 4, 7], [2, 5, 8］) 6. image paste using python: im = Image.open('/home/wangxiao/Pictures/9c1147d3gy1fjuyywz23sj20dl09u3yw.jpg') box = (100,100,500,500) region = im.crop(box) im.paste(region,(100,70)) im.show() 7. pytorch save checkpoints torch.save(model.state_dict(), filename) 8. install python3.5 on ubuntu system: sudo add-apt-repository ppa:fkrull/deadsnakes sudo apt-get update sudo apt-get install python3.5 when testing, just type: python3.5 9. load imge to tensor & save tensor data to image files. def tensor_load_rgbimage(filename, size=None, scale=None): img = Image.open(filename) if size is not None: img = img.resize((size, size), Image.ANTIALIAS) elif scale is not None: img = img.resize((int(img.size[0] / scale), int(img.size[1] / scale)), Image.ANTIALIAS) img = np.array(img).transpose(2, 0, 1) img = torch.from_numpy(img).float() return img def tensor_save_rgbimage(tensor, filename, cuda=False): if cuda: img = tensor.clone().cpu().clamp(0, 255).numpy() else: img = tensor.clone().clamp(0, 255).numpy() img = img.transpose(1, 2, 0).astype('uint8') img = Image.fromarray(img) img.save(filename) 10. the often used opeartions in pytorch: ########################## save log files ############################################# logfile_path = './log_files_AAE_2017.10.08.16:20.txt' fobj=open(logfile_path,'a') fobj.writelines(['Epoch: %d Niter:%d Loss_VAE: %.4f Loss_D: %.4f Loss_D_noise: %.4f Loss_G: %.4f D(x): %.4f D(G(z)): %.4f / %.4f \n' % (EEEPoch, total_epoch, VAEerr.data[0], errD_noise.data[0], errD.data[0], total_errG.data[0], D_x, D_G_z1, D_G_z2)]) fobj.close() # print('==>> saving txt files ... Done!') ########################### save checkpoints ########################### if epoch%opt.saveInt == 0 and epoch!=0: torch.save(netG.state_dict(), '%s/netG_epoch_%d.pth' % (opt.outf, epoch)) # torch.save(netD.state_dict(), '%s/netD_epoch_%d.pth' % (opt.outf, epoch)) # torch.save(netD_gaussian.state_dict(), '%s/netD_Z_epoch_%d.pth' % (opt.outf, epoch)) # ########################### save middle images into folders ########################### # img_index = EEEPoch + index_batch + epoch # if epoch % 10 == 0: # vutils.save_image(real_cpu, '%s/real_samples.png' % img_index, # normalize=True) # fake = netG.decoder(fixed_noise) # vutils.save_image(fake.data, # '%s/fake_samples_epoch_%03d.png' % (img_index, img_index), # normalize=True) 11. error: RuntimeError: tensors are on different GPUs ==>> this is caused you set data into GPU mode, but not pre-defined model. 12.
Ubuntu yindaoxiufu 引导修复（Boot Repair) from: http://blog.csdn.net/piaocoder/article/details/50589667 1. sudo add-apt-repository ppa:yannubuntu/boot-repair && sudo apt-get update 2. sudo apt-get install -y boot-repair && boot-repair 3. Then, you can see the followings window: ==>> Recommended repair Do as the windows asked. 4. Sometimes, you may also enount errors as follows: well, just reboot and try to use USB to repair this time.
Tutorials on training the Skip-thoughts vectors for features extraction of sentence. 1. Send emails and download the training dataset. the dataset used in skip_thoughts vectors is from [BookCorpus]: http://yknzhu.wixsite.com/mbweb first, you should send a email to the auther of this paper and ask for the link of this dataset. Then you will download the following files: unzip these files in the current folders. 2. Open and download the tensorflow version code. Do as the following links: https://github.com/tensorflow/models/tree/master/skip_thoughts Then, you will see the processing as follows: [Attention] when you install the bazel, you need to install this software, but do not update it. Or, it may shown you some errors in the following operations. 3. Encoding Sentences : (1). First, open a terminal and input "ipython" : (2). input the following code to the terminal: ipython # Launch iPython. In [0]: # Imports. from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np import os.path import scipy.spatial.distance as sd from skip_thoughts import configuration from skip_thoughts import encoder_manager In [1]: # Set paths to the model. VOCAB_FILE = "/path/to/vocab.txt" EMBEDDING_MATRIX_FILE = "/path/to/embeddings.npy" CHECKPOINT_PATH = "/path/to/model.ckpt-9999" # The following directory should contain files rt-polarity.neg and # rt-polarity.pos. For this moment, you already defined the environment, then, you need also do the followings: In [2]: # Set up the encoder. Here we are using a single unidirectional model. # To use a bidirectional model as well, call load_model() again with # configuration.model_config(bidirectional_encoder=True) and paths to the # bidirectional model's files. The encoder will use the concatenation of # all loaded models. encoder = encoder_manager.EncoderManager() encoder.load_model(configuration.model_config(), vocabulary_file=VOCAB_FILE, embedding_matrix_file=EMBEDDING_MATRIX_FILE, checkpoint_path=CHECKPOINT_PATH) In [3]: # Load the movie review dataset. data = [' This is my first attempt to the tensorflow version skip_thought_vectors ... '] The, it's time to get the 2400# features now. In [4]: # Generate Skip-Thought Vectors for each sentence in the dataset. encodings = encoder.encode(data) print(encodings) print(encodings[0]) You can see the results of the algorithm as followings: Now that, you have obtain the features of the input sentence. you can now load your texts to obtain the results. Come on ...
Agustinus Kristiadi's Blog TECH BLOG TRAVEL BLOG PORTFOLIO CONTACT ABOUT Variational Autoencoder: Intuition and Implementation There are two generative models facing neck to neck in the data generation business right now: Generative Adversarial Nets (GAN) and Variational Autoencoder (VAE). These two models have different take on how the models are trained. GAN is rooted in game theory, its objective is to find the Nash Equilibrium between discriminator net and generator net. On the other hand, VAE is rooted in bayesian inference, i.e. it wants to model the underlying probability distribution of data so that it could sample new data from that distribution. In this post, we will look at the intuition of VAE model and its implementation in Keras. VAE: Formulation and Intuition Suppose we want to generate a data. Good way to do it is first to decide what kind of data we want to generate, then actually generate the data. For example, say, we want to generate an animal. First, we imagine the animal: it must have four legs, and it must be able to swim. Having those criteria, we could then actually generate the animal by sampling from the animal kingdom. Lo and behold, we get Platypus! From the story above, our imagination is analogous to latent variable. It is often useful to decide the latent variable first in generative models, as latent variable could describe our data. Without latent variable, it is as if we just generate data blindly. And this is the difference between GAN and VAE: VAE uses latent variable, hence it’s an expressive model. Alright, that fable is great and all, but how do we model that? Well, let’s talk about probability distribution. Let’s define some notions: XX: data that we want to model a.k.a the animal zz: latent variable a.k.a our imagination P(X)P(X): probability distribution of the data, i.e. that animal kingdom P(z)P(z): probability distribution of latent variable, i.e. our brain, the source of our imagination P(X|z)P(X|z): distribution of generating data given latent variable, e.g. turning imagination into real animal Our objective here is to model the data, hence we want to find P(X)P(X). Using the law of probability, we could find it in relation with zz as follows: P(X)=∫P(X|z)P(z)dzP(X)=∫P(X|z)P(z)dz that is, we marginalize out zz from the joint probability distribution P(X,z)P(X,z). Now if only we know P(X,z)P(X,z), or equivalently, P(X|z)P(X|z) and P(z)P(z)… The idea of VAE is to infer P(z)P(z) using P(z|X)P(z|X). This is make a lot of sense if we think about it: we want to make our latent variable likely under our data. Talking in term of our fable example, we want to limit our imagination only on animal kingdom domain, so we shouldn’t imagine about things like root, leaf, tyre, glass, GPU, refrigerator, doormat, … as it’s unlikely that those things have anything to do with things that come from the animal kingdom. Right? But the problem is, we have to infer that distribution P(z|X)P(z|X), as we don’t know it yet. In VAE, as it name suggests, we infer P(z|X)P(z|X) using a method called Variational Inference (VI). VI is one of the popular choice of method in bayesian inference, the other one being MCMC method. The main idea of VI is to pose the inference by approach it as an optimization problem. How? By modeling the true distribution P(z|X)P(z|X) using simpler distribution that is easy to evaluate, e.g. Gaussian, and minimize the difference between those two distribution using KL divergence metric, which tells us how difference it is PP and QQ. Alright, now let’s say we want to infer P(z|X)P(z|X) using Q(z|X)Q(z|X). The KL divergence then formulated as follows: DKL[Q(z|X)∥P(z|X)]=∑zQ(z|X)logQ(z|X)P(z|X)=E[logQ(z|X)P(z|X)]=E[logQ(z|X)−logP(z|X)]DKL[Q(z|X)‖P(z|X)]=∑zQ(z|X)logQ(z|X)P(z|X)=E[logQ(z|X)P(z|X)]=E[logQ(z|X)−logP(z|X)] Recall the notations above, there are two things that we haven’t use, namely P(X)P(X), P(X|z)P(X|z), and P(z)P(z). But, with Bayes’ rule, we could make it appear in the equation: DKL[Q(z|X)∥P(z|X)]=E[logQ(z|X)−logP(X|z)P(z)P(X)]=E[logQ(z|X)−(logP(X|z)+logP(z)−logP(X))]=E[logQ(z|X)−logP(X|z)−logP(z)+logP(X)]DKL[Q(z|X)‖P(z|X)]=E[logQ(z|X)−logP(X|z)P(z)P(X)]=E[logQ(z|X)−(logP(X|z)+logP(z)−logP(X))]=E[logQ(z|X)−logP(X|z)−logP(z)+logP(X)] Notice that the expectation is over z and zP(X) doesn’t depend on P(X)z, so we could move it outside of the expectation.zDKL[Q(z|X)∥P(z|X)]=E[logQ(z|X)−logP(X|z)−logP(z)]+logP(X)DKL[Q(z|X)∥P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)]DKL[Q(z|X)‖P(z|X)]=E[logQ(z|X)−logP(X|z)−logP(z)]+logP(X)DKL[Q(z|X)‖P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)] If we look carefully at the right hand side of the equation, we would notice that it could be rewritten as another KL divergence. So let’s do that by first rearranging the sign. DKL[Q(z|X)∥P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)]logP(X)−DKL[Q(z|X)∥P(z|X)]=E[logP(X|z)−(logQ(z|X)−logP(z))]=E[logP(X|z)]−E[logQ(z|X)−logP(z)]=E[logP(X|z)]−DKL[Q(z|X)∥P(z)]DKL[Q(z|X)‖P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)]logP(X)−DKL[Q(z|X)‖P(z|X)]=E[logP(X|z)−(logQ(z|X)−logP(z))]=E[logP(X|z)]−E[logQ(z|X)−logP(z)]=E[logP(X|z)]−DKL[Q(z|X)‖P(z)] And this is it, the VAE objective function: logP(X)−DKL[Q(z|X)∥P(z|X)]=E[logP(X|z)]−DKL[Q(z|X)∥P(z)]logP(X)−DKL[Q(z|X)‖P(z|X)]=E[logP(X|z)]−DKL[Q(z|X)‖P(z)] At this point, what do we have? Let’s enumerate: Q(z|X) that project our data Q(z|X)X into latent variable spaceX z, the latent variablez P(X|z) that generate data given latent variableP(X|z) We might feel familiar with this kind of structure. And guess what, it’s the same structure as seen in ! That is, AutoencoderQ(z|X) is the encoder net, Q(z|X)z is the encoded representation, and zP(X|z) is the decoder net! Well, well, no wonder the name of this model is Variational Autoencoder!P(X|z) VAE: Dissecting the Objective It turns out, VAE objective function has a very nice interpretation. That is, we want to model our data, which described by logP(X), under some error logP(X)DKL[Q(z|X)∥P(z|X)]. In other words, VAE tries to find the lower bound of DKL[Q(z|X)‖P(z|X)]logP(X), which in practice is good enough as trying to find the exact distribution is often untractable.logP(X) That model then could be found by maximazing over some mapping from latent variable to data logP(X|z) and minimizing the difference between our simple distribution logP(X|z)Q(z|X) and the true latent distribution Q(z|X)P(z).P(z) As we might already know, maximizing E[logP(X|z)] is a maximum likelihood estimation. We basically see it all the time in discriminative supervised model, for example Logistic Regression, SVM, or Linear Regression. In the other words, given an input E[logP(X|z)]z and an output zX, we want to maximize the conditional distribution XP(X|z) under some model parameters. So we could implement it by using any classifier with input P(X|z)z and output zX, then optimize the objective function by using for example log loss or regression loss.X What about DKL[Q(z|X)∥P(z)]? Here, DKL[Q(z|X)‖P(z)]P(z) is the latent variable distribution. We might want to sample P(z)P(z) later, so the easiest choice is P(z)N(0,1). Hence, we want to make N(0,1)Q(z|X) to be as close as possible to Q(z|X)N(0,1) so that we could sample it easily.N(0,1) Having P(z)=N(0,1) also add another benefit. Let’s say we also want P(z)=N(0,1)Q(z|X) to be Gaussian with parameters Q(z|X)μ(X) and μ(X)Σ(X), i.e. the mean and variance given X. Then, the KL divergence between those two distribution could be computed in closed form!Σ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12(tr(Σ(X))+μ(X)Tμ(X)−k−logdet(Σ(X)))DKL[N(μ(X),Σ(X))‖N(0,1)]=12(tr(Σ(X))+μ(X)Tμ(X)−k−logdet(Σ(X))) Above, k is the dimension of our Gaussian. ktr(X) is trace function, i.e. sum of the diagonal of matrix tr(X)X. The determinant of a diagonal matrix could be computed as product of its diagonal. So really, we could implement XΣ(X) as just a vector as it’s a diagonal matrix:Σ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12(∑kΣ(X)+∑kμ2(X)−∑k1−log∏kΣ(X))=12(∑kΣ(X)+∑kμ2(X)−∑k1−∑klogΣ(X))=12∑k(Σ(X)+μ2(X)−1−logΣ(X))DKL[N(μ(X),Σ(X))‖N(0,1)]=12(∑kΣ(X)+∑kμ2(X)−∑k1−log∏kΣ(X))=12(∑kΣ(X)+∑kμ2(X)−∑k1−∑klogΣ(X))=12∑k(Σ(X)+μ2(X)−1−logΣ(X)) In practice, however, it’s better to model Σ(X) as Σ(X)logΣ(X), as it is more numerically stable to take exponent compared to computing log. Hence, our final KL divergence term is:logΣ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12∑k(exp(Σ(X))+μ2(X)−1−Σ(X))DKL[N(μ(X),Σ(X))‖N(0,1)]=12∑k(exp(Σ(X))+μ2(X)−1−Σ(X)) Implementation in Keras First, let’s implement the encoder net Q(z|X), which takes input Q(z|X)X and outputting two things: Xμ(X) and μ(X)Σ(X), the parameters of the Gaussian.Σ(X) from tensorflow.examples.tutorials.mnist import input_data from keras.layers import Input, Dense, Lambda from keras.models import Model from keras.objectives import binary_crossentropy from keras.callbacks import LearningRateScheduler import numpy as np import matplotlib.pyplot as plt import keras.backend as K import tensorflow as tf m = 50 n_z = 2 n_epoch = 10 # Q(z|X) -- encoder inputs = Input(shape=(784,)) h_q = Dense(512, activation='relu')(inputs) mu = Dense(n_z, activation='linear')(h_q) log_sigma = Dense(n_z, activation='linear')(h_q) That is, our Q(z|X) is a neural net with one hidden layer. In this implementation, our latent variable is two dimensional, so that we could easily visualize it. In practice though, more dimension in latent variable should be better.Q(z|X) However, we are now facing a problem. How do we get z from the encoder outputs? Obviously we could sample zz from a Gaussian which parameters are the outputs of the encoder. Alas, sampling directly won’t do, if we want to train VAE with gradient descent as the sampling operation doesn’t have gradient!z There is, however a trick called reparameterization trick, which makes the network differentiable. Reparameterization trick basically divert the non-differentiable operation out of the network, so that, even though we still involve a thing that is non-differentiable, at least it is out of the network, hence the network could still be trained. The reparameterization trick is as follows. Recall, if we have x∼N(μ,Σ) and then standardize it so that x∼N(μ,Σ)μ=0,Σ=1, we could revert it back to the original distribution by reverting the standardization process. Hence, we have this equation:μ=0,Σ=1x=μ+Σ12xstdx=μ+Σ12xstd With that in mind, we could extend it. If we sample from a standard normal distribution, we could convert it to any Gaussian we want if we know the mean and the variance. Hence we could implement our sampling operation of z by:zz=μ(X)+Σ12(X)ϵz=μ(X)+Σ12(X)ϵ where ϵ∼N(0,1).ϵ∼N(0,1) Now, during backpropagation, we don’t care anymore with the sampling process, as it is now outside of the network, i.e. doesn’t depend on anything in the net, hence the gradient won’t flow through it. def sample_z(args): mu, log_sigma = args eps = K.random_normal(shape=(m, n_z), mean=0., std=1.) return mu + K.exp(log_sigma / 2) * eps # Sample z ~ Q(z|X) z = Lambda(sample_z)([mu, log_sigma]) Now we create the decoder net P(X|z):P(X|z) # P(X|z) -- decoder decoder_hidden = Dense(512, activation='relu') decoder_out = Dense(784, activation='sigmoid') h_p = decoder_hidden(z) outputs = decoder_out(h_p) Lastly, from this model, we can do three things: reconstruct inputs, encode inputs into latent variables, and generate data from latent variable. So, we have three Keras models: # Overall VAE model, for reconstruction and training vae = Model(inputs, outputs) # Encoder model, to encode input into latent variable # We use the mean as the output as it is the center point, the representative of the gaussian encoder = Model(inputs, mu) # Generator model, generate new data given latent variable z d_in = Input(shape=(n_z,)) d_h = decoder_hidden(d_in) d_out = decoder_out(d_h) decoder = Model(d_in, d_out) Then, we need to translate our loss into Keras code: def vae_loss(y_true, y_pred): """ Calculate loss = reconstruction loss + KL loss for each data in minibatch """ # E[log P(X|z)] recon = K.sum(K.binary_crossentropy(y_pred, y_true), axis=1) # D_KL(Q(z|X) || P(z|X)); calculate in closed form as both dist. are Gaussian kl = 0.5 * K.sum(K.exp(log_sigma) + K.square(mu) - 1. - log_sigma, axis=1) return recon + kl and then train it: vae.compile(optimizer='adam', loss=vae_loss) vae.fit(X_train, X_train, batch_size=m, nb_epoch=n_epoch) And that’s it, the implementation of VAE in Keras! Implementation on MNIST Data We could use any dataset really, but like always, we will use MNIST as an example. After we trained our VAE model, we then could visualize the latent variable space Q(z|X):Q(z|X) As we could see, in the latent space, the representation of our data that have the same characteristic, e.g. same label, are close to each other. Notice that in the training phase, we never provide any information regarding the data. We could also look at the data reconstruction by running through the data into overall VAE net: Lastly, we could generate new sample by first sample z∼N(0,1) and feed it into our decoder net:z∼N(0,1) If we look closely on the reconstructed and generated data, we would notice that some of the data are ambiguous. For example the digit 5 looks like 3 or 8. That’s because our latent variable space is a continous distribution (i.e. N(0,1)), hence there bound to be some smooth transition on the edge of the clusters. And also, the cluster of digits are close to each other if they are somewhat similar. That’s why in the latent space, 5 is close to 3.N(0,1) Conclusion In this post we looked at the intuition behind Variational Autoencoder (VAE), its formulation, and its implementation in Keras. We also saw the difference between VAE and GAN, the two most popular generative models nowadays. For more math on VAE, be sure to hit the original paper by Kingma et al., 2014. There is also an excellent tutorial on VAE by Carl Doersch. Check out the references section below. The full code is available in my repo: https://github.com/wiseodd/generative-models References Doersch, Carl. “Tutorial on variational autoencoders.” arXiv preprint arXiv:1606.05908 (2016). Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013). https://blog.keras.io/building-autoencoders-in-keras.html
资源|TensorFlow初学者必须了解的55个经典案例 2017-05-27 全球人工智能 >>>>>>欢迎投稿：news@top25.cn<<<<<< 文章来源：github 采编：lily 本文是TensorFlow实现流行机器学习算法的教程汇集，目标是让读者可以轻松通过清晰简明的案例深入了解 TensorFlow。这些案例适合那些想要实现一些 TensorFlow 案例的初学者。本教程包含还包含笔记和带有注解的代码。 第一步：给TF新手的教程指南 1：tf初学者需要明白的入门准备 机器学习入门笔记： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/0_Prerequisite/ml_introduction.ipynb MNIST 数据集入门笔记 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/0_Prerequisite/mnist_dataset_intro.ipynb 2：tf初学者需要了解的入门基础 Hello World https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/1_Introduction/helloworld.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/1_Introduction/helloworld.py 基本操作 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/1_Introduction/basic_operations.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/1_Introduction/basic_operations.py 3：tf初学者需要掌握的基本模型 最近邻： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/2_BasicModels/nearest_neighbor.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py 线性回归： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/2_BasicModels/linear_regression.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/linear_regression.py Logistic 回归： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/2_BasicModels/logistic_regression.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/logistic_regression.py 4：tf初学者需要尝试的神经网络 多层感知器： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/multilayer_perceptron.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/multilayer_perceptron.py 卷积神经网络： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/convolutional_network.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py 循环神经网络（LSTM）： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/recurrent_network.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py 双向循环神经网络（LSTM）： https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/bidirectional_rnn.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/bidirectional_rnn.py 动态循环神经网络（LSTM） https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/dynamic_rnn.py 自编码器 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/autoencoder.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/autoencoder.py 5：tf初学者需要精通的实用技术 保存和恢复模型 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/4_Utils/save_restore_model.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/save_restore_model.py 图和损失可视化 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/4_Utils/tensorboard_basic.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/tensorboard_basic.py Tensorboard——高级可视化 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/tensorboard_advanced.py 5：tf初学者需要的懂得的多GPU基本操作 多 GPU 上的基本操作 https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/5_MultiGPU/multigpu_basics.ipynb https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/5_MultiGPU/multigpu_basics.py 6：案例需要的数据集 有一些案例需要 MNIST 数据集进行训练和测试。运行这些案例时，该数据集会被自动下载下来（使用 input_data.py）。 MNIST数据集笔记：https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/0_Prerequisite/mnist_dataset_intro.ipynb 官方网站：http://yann.lecun.com/exdb/mnist/ 第二步：为TF新手准备的各个类型的案例、模型和数据集 初步了解：TFLearn TensorFlow 接下来的示例来自TFLearn，这是一个为 TensorFlow 提供了简化的接口的库。里面有很多示例和预构建的运算和层。 使用教程：TFLearn 快速入门。通过一个具体的机器学习任务学习 TFLearn 基础。开发和训练一个深度神经网络分类器。 TFLearn地址：https://github.com/tflearn/tflearn 示例：https://github.com/tflearn/tflearn/tree/master/examples 预构建的运算和层：http://tflearn.org/doc_index/#api 笔记：https://github.com/tflearn/tflearn/blob/master/tutorials/intro/quickstart.md 基础模型以及数据集 线性回归，使用 TFLearn 实现线性回归 https://github.com/tflearn/tflearn/blob/master/examples/basics/linear_regression.py 逻辑运算符。使用 TFLearn 实现逻辑运算符 https://github.com/tflearn/tflearn/blob/master/examples/basics/logical.py 权重保持。保存和还原一个模型 https://github.com/tflearn/tflearn/blob/master/examples/basics/weights_persistence.py 微调。在一个新任务上微调一个预训练的模型 https://github.com/tflearn/tflearn/blob/master/examples/basics/finetuning.py 使用 HDF5。使用 HDF5 处理大型数据集 https://github.com/tflearn/tflearn/blob/master/examples/basics/use_hdf5.py 使用 DASK。使用 DASK 处理大型数据集 https://github.com/tflearn/tflearn/blob/master/examples/basics/use_dask.py 计算机视觉模型及数据集 多层感知器。一种用于 MNIST 分类任务的多层感知实现 https://github.com/tflearn/tflearn/blob/master/examples/images/dnn.py 卷积网络（MNIST）。用于分类 MNIST 数据集的一种卷积神经网络实现 https://github.com/tflearn/tflearn/blob/master/examples/images/convnet_mnist.py 卷积网络（CIFAR-10）。用于分类 CIFAR-10 数据集的一种卷积神经网络实现 https://github.com/tflearn/tflearn/blob/master/examples/images/convnet_cifar10.py 网络中的网络。用于分类 CIFAR-10 数据集的 Network in Network 实现 https://github.com/tflearn/tflearn/blob/master/examples/images/network_in_network.py Alexnet。将 Alexnet 应用于 Oxford Flowers 17 分类任务 https://github.com/tflearn/tflearn/blob/master/examples/images/alexnet.py VGGNet。将 VGGNet 应用于 Oxford Flowers 17 分类任务 https://github.com/tflearn/tflearn/blob/master/examples/images/vgg_network.py VGGNet Finetuning (Fast Training)。使用一个预训练的 VGG 网络并将其约束到你自己的数据上，以便实现快速训练 https://github.com/tflearn/tflearn/blob/master/examples/images/vgg_network_finetuning.py RNN Pixels。使用 RNN（在像素的序列上）分类图像 https://github.com/tflearn/tflearn/blob/master/examples/images/rnn_pixels.py Highway Network。用于分类 MNIST 数据集的 Highway Network 实现 https://github.com/tflearn/tflearn/blob/master/examples/images/highway_dnn.py Highway Convolutional Network。用于分类 MNIST 数据集的 Highway Convolutional Network 实现 https://github.com/tflearn/tflearn/blob/master/examples/images/convnet_highway_mnist.py Residual Network (MNIST) 。应用于 MNIST 分类任务的一种瓶颈残差网络（bottleneck residual network） https://github.com/tflearn/tflearn/blob/master/examples/images/residual_network_mnist.py Residual Network (CIFAR-10)。应用于 CIFAR-10 分类任务的一种残差网络 https://github.com/tflearn/tflearn/blob/master/examples/images/residual_network_cifar10.py Google Inception（v3）。应用于 Oxford Flowers 17 分类任务的谷歌 Inception v3 网络 https://github.com/tflearn/tflearn/blob/master/examples/images/googlenet.py 自编码器。用于 MNIST 手写数字的自编码器 https://github.com/tflearn/tflearn/blob/master/examples/images/autoencoder.py 自然语言处理模型及数据集 循环神经网络（LSTM），应用 LSTM 到 IMDB 情感数据集分类任 https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm.py 双向 RNN（LSTM），将一个双向 LSTM 应用到 IMDB 情感数据集分类任务： https://github.com/tflearn/tflearn/blob/master/examples/nlp/bidirectional_lstm.py 动态 RNN（LSTM），利用动态 LSTM 从 IMDB 数据集分类可变长度文本： https://github.com/tflearn/tflearn/blob/master/examples/nlp/dynamic_lstm.py 城市名称生成，使用 LSTM 网络生成新的美国城市名： https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm_generator_cityname.py 莎士比亚手稿生成，使用 LSTM 网络生成新的莎士比亚手稿： https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm_generator_shakespeare.py Seq2seq，seq2seq 循环网络的教学示例： https://github.com/tflearn/tflearn/blob/master/examples/nlp/seq2seq_example.py CNN Seq，应用一个 1-D 卷积网络从 IMDB 情感数据集中分类词序列 https://github.com/tflearn/tflearn/blob/master/examples/nlp/cnn_sentence_classification.py 强化学习案例 Atari Pacman 1-step Q-Learning，使用 1-step Q-learning 教一台机器玩 Atari 游戏： https://github.com/tflearn/tflearn/blob/master/examples/reinforcement_learning/atari_1step_qlearning.py 第三步：为TF新手准备的其他方面内容 Recommender-Wide&Deep Network，推荐系统中 wide & deep 网络的教学示例： https://github.com/tflearn/tflearn/blob/master/examples/others/recommender_wide_and_deep.py Spiral Classification Problem，对斯坦福 CS231n spiral 分类难题的 TFLearn 实现： https://github.com/tflearn/tflearn/blob/master/examples/notebooks/spiral.ipynb 层，与 TensorFlow 一起使用 TFLearn 层： https://github.com/tflearn/tflearn/blob/master/examples/extending_tensorflow/layers.py 训练器，使用 TFLearn 训练器类训练任何 TensorFlow 图： https://github.com/tflearn/tflearn/blob/master/examples/extending_tensorflow/layers.py Bulit-in Ops，连同 TensorFlow 使用 TFLearn built-in 操作： https://github.com/tflearn/tflearn/blob/master/examples/extending_tensorflow/builtin_ops.py Summaries，连同 TensorFlow 使用 TFLearn summarizers： https://github.com/tflearn/tflearn/blob/master/examples/extending_tensorflow/summaries.py Variables，连同 TensorFlow 使用 TFLearn Variables： https://github.com/tflearn/tflearn/blob/master/examples/extending_tensorflow/variables.py
自然语言处理中的Attention Model：是什么及为什么 2017-07-13 张俊林 待字闺中 要是关注深度学习在自然语言处理方面的研究进展，我相信你一定听说过Attention Model（后文有时会简称AM模型）这个词。AM模型应该说是过去一年来NLP领域中的重要进展之一，在很多场景被证明有效。听起来AM很高大上，其实它的基本思想是相当直观简洁的。本文作者可以对灯发誓：在你读完这篇啰里啰嗦的文章及其后续文章后，一定可以透彻了解AM到底是什么，以及轻易看懂任何有关论文看上去复杂的数学公式部分。怎么样，这广告打的挺有吸引力吧，尤其是对那些患有数学公式帕金森病的患者。 在正戏开演前，我们先来点题外话。 |引言及废话 你应该常常听到被捉奸在床的男性经常感叹地说一句话：女性的第六感通常都很准，当然这里的女性一般是特指这位男性的老婆或者女友，当然也可能是他的某位具有女性气质的男友。要我说，男人的第六感其实也不差（这里的“男人”特指本文作者本人，当然非上文所引用的“男性”，为避免混淆特做声明）。当我第一次看到机器学习领域中的Attention Model这个名字的时候，我的第一直觉就是：这是从认知心理学里面的人脑注意力模型引入的概念。若干年前，也就是在我年轻不懂事的花样年华里，曾有一阵子沉迷于人脑的工作机制，大量阅读了认知心理学方面的书籍和论文，而一般注意力模型会作为书籍的单独一章来讲。下面请允许我显摆一下鄙人渊博的知识。 注意力这东西其实挺有意思，但是很容易被人忽略。让我们来直观地体会一下什么是人脑中的注意力模型。首先，请您睁开眼并确认自己处于意识清醒状态；第二步，请找到本文最近出现的一个“Attention Model”字眼（就是“字眼”前面的两个英文单词，…^@@^）并盯住看三秒钟。好，假设此刻时间停止，在这三秒钟你眼中和脑中看到的是什么？对了，就是“Attention Model”这两个词，但是你应该意识到，其实你眼中是有除了这两个单词外的整个一副画面的，但是在你盯着看的这三秒钟，时间静止，万物无息，仿佛这个世界只有我和你…..对不起，串景了，仿佛这个世界只有“Attention Model”这两个单词。这是什么？这就是人脑的注意力模型，就是说你看到了整幅画面，但在特定的时刻t，你的意识和注意力的焦点是集中在画面中的某一个部分上，其它部分虽然还在你的眼中，但是你分配给它们的注意力资源是很少的。其实，只要你睁着眼，注意力模型就无时不刻在你身上发挥作用，比如你过马路，其实你的注意力会被更多地分配给红绿灯和来往的车辆上，虽然此时你看到了整个世界；比如你很精心地偶遇到了你心仪的异性，此刻你的注意力会更多的分配在此时神光四射的异性身上，虽然此刻你看到了整个世界，但是它们对你来说跟不存在是一样的….. 这就是人脑的注意力模型，说到底是一种资源分配模型，在某个特定时刻，你的注意力总是集中在画面中的某个焦点部分，而对其它部分视而不见。 其实吧，深度学习里面的注意力模型工作机制啊，它跟你看见心动异性时荷尔蒙驱动的注意力分配机制是一样一样的。 好，前戏结束，正戏开场。 |Encoder-Decoder框架 本文只谈谈文本处理领域的AM模型，在图片处理或者（图片-图片标题）生成等任务中也有很多场景会应用AM模型，但是我们此处只谈文本领域的AM模型，其实图片领域AM的机制也是相同的。 要提文本处理领域的AM模型，就不得不先谈Encoder-Decoder框架，因为目前绝大多数文献中出现的AM模型是附着在Encoder-Decoder框架下的，当然，其实AM模型可以看作一种通用的思想，本身并不依赖于Encoder-Decoder模型，这点需要注意。 Encoder-Decoder框架可以看作是一种文本处理领域的研究模式，应用场景异常广泛，本身就值得非常细致地谈一下，但是因为本文的注意力焦点在AM模型，所以此处我们就只谈一些不得不谈的内容，详细的Encoder-Decoder模型以后考虑专文介绍。下图是文本处理领域里常用的Encoder-Decoder框架最抽象的一种表示： 图1. 抽象的Encoder-Decoder框架 Encoder-Decoder框架可以这么直观地去理解：可以把它看作适合处理由一个句子（或篇章）生成另外一个句子（或篇章）的通用处理模型。对于句子对<X,Y>，我们的目标是给定输入句子X，期待通过Encoder-Decoder框架来生成目标句子Y。X和Y可以是同一种语言，也可以是两种不同的语言。而X和Y分别由各自的单词序列构成： Encoder顾名思义就是对输入句子X进行编码，将输入句子通过非线性变换转化为中间语义表示C： 对于解码器Decoder来说，其任务是根据句子X的中间语义表示C和之前已经生成的历史信息y1,y2….yi-1来生成i时刻要生成的单词yi 每个yi都依次这么产生，那么看起来就是整个系统根据输入句子X生成了目标句子Y。 Encoder-Decoder是个非常通用的计算框架，至于Encoder和Decoder具体使用什么模型都是由研究者自己定的，常见的比如CNN/RNN/BiRNN/GRU/LSTM/Deep LSTM等，这里的变化组合非常多，而很可能一种新的组合就能攒篇论文，所以有时候科研里的创新就是这么简单。比如我用CNN作为Encoder，用RNN作为Decoder，你用BiRNN做为Encoder，用深层LSTM作为Decoder，那么就是一个创新。所以正准备跳楼的憋着劲想攒论文毕业的同学可以从天台下来了，当然是走下来，不是让你跳下来，你可以好好琢磨一下这个模型，把各种排列组合都试试，只要你能提出一种新的组合并被证明有效，那恭喜你：施主，你可以毕业了。 扯远了，再拉回来。 Encoder-Decoder是个创新游戏大杀器，一方面如上所述，可以搞各种不同的模型组合，另外一方面它的应用场景多得不得了，比如对于机器翻译来说，<X,Y>就是对应不同语言的句子，比如X是英语句子，Y是对应的中文句子翻译。再比如对于文本摘要来说，X就是一篇文章，Y就是对应的摘要；再比如对于对话机器人来说，X就是某人的一句话，Y就是对话机器人的应答；再比如……总之，太多了。哎，那位施主，听老衲的话，赶紧从天台下来吧，无数创新在等着你发掘呢。 |Attention Model 图1中展示的Encoder-Decoder模型是没有体现出“注意力模型”的，所以可以把它看作是注意力不集中的分心模型。为什么说它注意力不集中呢？请观察下目标句子Y中每个单词的生成过程如下： 其中f是decoder的非线性变换函数。从这里可以看出，在生成目标句子的单词时，不论生成哪个单词，是y1,y2也好，还是y3也好，他们使用的句子X的语义编码C都是一样的，没有任何区别。而语义编码C是由句子X的每个单词经过Encoder 编码产生的，这意味着不论是生成哪个单词，y1,y2还是y3，其实句子X中任意单词对生成某个目标单词yi来说影响力都是相同的，没有任何区别（其实如果Encoder是RNN的话，理论上越是后输入的单词影响越大，并非等权的，估计这也是为何Google提出Sequence to Sequence模型时发现把输入句子逆序输入做翻译效果会更好的小Trick的原因）。这就是为何说这个模型没有体现出注意力的缘由。这类似于你看到眼前的画面，但是没有注意焦点一样。如果拿机器翻译来解释这个分心模型的Encoder-Decoder框架更好理解，比如输入的是英文句子：Tom chase Jerry，Encoder-Decoder框架逐步生成中文单词：“汤姆”，“追逐”，“杰瑞”。在翻译“杰瑞”这个中文单词的时候，分心模型里面的每个英文单词对于翻译目标单词“杰瑞”贡献是相同的，很明显这里不太合理，显然“Jerry”对于翻译成“杰瑞”更重要，但是分心模型是无法体现这一点的，这就是为何说它没有引入注意力的原因。没有引入注意力的模型在输入句子比较短的时候估计问题不大，但是如果输入句子比较长，此时所有语义完全通过一个中间语义向量来表示，单词自身的信息已经消失，可想而知会丢失很多细节信息，这也是为何要引入注意力模型的重要原因。 上面的例子中，如果引入AM模型的话，应该在翻译“杰瑞”的时候，体现出英文单词对于翻译当前中文单词不同的影响程度，比如给出类似下面一个概率分布值： （Tom,0.3）(Chase,0.2)(Jerry,0.5) 每个英文单词的概率代表了翻译当前单词“杰瑞”时，注意力分配模型分配给不同英文单词的注意力大小。这对于正确翻译目标语单词肯定是有帮助的，因为引入了新的信息。同理，目标句子中的每个单词都应该学会其对应的源语句子中单词的注意力分配概率信息。这意味着在生成每个单词Yi的时候，原先都是相同的中间语义表示C会替换成根据当前生成单词而不断变化的Ci。理解AM模型的关键就是这里，即由固定的中间语义表示C换成了根据当前输出单词来调整成加入注意力模型的变化的Ci。增加了AM模型的Encoder-Decoder框架理解起来如图2所示。 图2 引入AM模型的Encoder-Decoder框架 即生成目标句子单词的过程成了下面的形式： 而每个Ci可能对应着不同的源语句子单词的注意力分配概率分布，比如对于上面的英汉翻译来说，其对应的信息可能如下： 其中，f2函数代表Encoder对输入英文单词的某种变换函数，比如如果Encoder是用的RNN模型的话，这个f2函数的结果往往是某个时刻输入xi后隐层节点的状态值；g代表Encoder根据单词的中间表示合成整个句子中间语义表示的变换函数，一般的做法中，g函数就是对构成元素加权求和，也就是常常在论文里看到的下列公式： 假设Ci中那个i就是上面的“汤姆”，那么Tx就是3，代表输入句子的长度，h1=f(“Tom”)，h2=f(“Chase”),h3=f(“Jerry”)，对应的注意力模型权值分别是0.6,0.2,0.2，所以g函数就是个加权求和函数。如果形象表示的话，翻译中文单词“汤姆”的时候，数学公式对应的中间语义表示Ci的形成过程类似下图： 图3 Ci的形成过程 这里还有一个问题：生成目标句子某个单词，比如“汤姆”的时候，你怎么知道AM模型所需要的输入句子单词注意力分配概率分布值呢？就是说“汤姆”对应的概率分布： （Tom,0.6）(Chase,0.2)(Jerry,0.2） 是如何得到的呢？ 为了便于说明，我们假设对图1的非AM模型的Encoder-Decoder框架进行细化，Encoder采用RNN模型，Decoder也采用RNN模型，这是比较常见的一种模型配置，则图1的图转换为下图： 图4 RNN作为具体模型的Encoder-Decoder框架 那么用下图可以较为便捷地说明注意力分配概率分布值的通用计算过程： 图5 AM注意力分配概率计算 对于采用RNN的Decoder来说，如果要生成yi单词，在时刻i，我们是可以知道在生成Yi之前的隐层节点i时刻的输出值Hi的，而我们的目的是要计算生成Yi时的输入句子单词“Tom”、“Chase”、“Jerry”对Yi来说的注意力分配概率分布，那么可以用i时刻的隐层节点状态Hi去一一和输入句子中每个单词对应的RNN隐层节点状态hj进行对比，即通过函数F(hj,Hi)来获得目标单词Yi和每个输入单词对应的对齐可能性，这个F函数在不同论文里可能会采取不同的方法，然后函数F的输出经过Softmax进行归一化就得到了符合概率分布取值区间的注意力分配概率分布数值。图5显示的是当输出单词为“汤姆”时刻对应的输入句子单词的对齐概率。绝大多数AM模型都是采取上述的计算框架来计算注意力分配概率分布信息，区别只是在F的定义上可能有所不同。 上述内容就是论文里面常常提到的Soft Attention Model的基本思想，你能在文献里面看到的大多数AM模型基本就是这个模型，区别很可能只是把这个模型用来解决不同的应用问题。那么怎么理解AM模型的物理含义呢？一般文献里会把AM模型看作是单词对齐模型，这是非常有道理的。目标句子生成的每个单词对应输入句子单词的概率分布可以理解为输入句子单词和这个目标生成单词的对齐概率，这在机器翻译语境下是非常直观的：传统的统计机器翻译一般在做的过程中会专门有一个短语对齐的步骤，而注意力模型其实起的是相同的作用。在其他应用里面把AM模型理解成输入句子和目标句子单词之间的对齐概率也是很顺畅的想法。 当然，我觉得从概念上理解的话，把AM模型理解成影响力模型也是合理的，就是说生成目标单词的时候，输入句子每个单词对于生成这个单词有多大的影响程度。这种想法也是比较好理解AM模型物理意义的一种思维方式。 图6是论文“A Neural Attention Model for Sentence Summarization”中，Rush用AM模型来做生成式摘要给出的一个AM的一个非常直观的例子。 图6 句子生成式摘要例子 这个例子中，Encoder-Decoder框架的输入句子是：“russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism”。对应图中纵坐标的句子。系统生成的摘要句子是：“russia calls for joint front against terrorism”，对应图中横坐标的句子。可以看出模型已经把句子主体部分正确地抽出来了。矩阵中每一列代表生成的目标单词对应输入句子每个单词的AM分配概率，颜色越深代表分配到的概率越大。这个例子对于直观理解AM是很有帮助作用的。 最后是广告：关于AM，我们除了本文，下周还会有续集：从AM来谈谈两种科研创新模式，请不要转台，继续关注，谢谢。 微信扫一扫关注该公众号
NLP--- How to install the tool NLTK in Ubuntu ? 1. open the website of NLTK and download it. https://pypi.python.org/pypi/nltk 2. unzip this package and cd: >>> cd /home/wangxiao/nltk-3.2.4 >>> python setup.py install ## you have installed this software at this point. >>> python >>> import nltk ## if it did not show you any error, it denotes you have installed nltk successfully. >>> nltk.download() ## install the language packages. >>> select all, then you will see blow ...
Perceptual Generative Adversarial Networks for Small Object Detection 2017-07-11 19:47:46 CVPR 2017 This paper use GAN to handle the issue of small object detection which is a very hard problem in general object detection. As shown in the following figures, small object and large objects usually shown different representations from the feature level. Thus, it is possbile to use Percetual GAN to super-resolution of feature maps of small objects to obtain better detection performance. It consists of two subnetworks, i.e., a generator network and a perceptual discriminator network. Specifically, the generator is a deep residual based feature generative model which transforms the original poor features of small objects to highly discriminative ones by introducing fine-grained details from lower-level layers, achieving “super-resolution” on the intermediate representations. Different from normal GAN, this network also introduce a new perceptual loss tailored from the detection purpose. That is to say, the discriminator not only need to deal with the adversarial loss, but also need to justify the detection accuray benefiting from the generated super-resolved features with a perceptual loss. The proposed contributions: (1) We are the first to successfully apply GAN-alike models to solve the challenging small-scale object detection problems. (2) We introduce a new conditional generator model that learns the additive residual representation between large and small objects, instead of generating the complete representations as before. (3) We introduce a new perceptual discriminator that provides more comprehensive supervision beneficial for detections, instead of barely differentiating fake and real. (4) Successful applications on traffic sign detection and pedestrian detection have been achieved with the state-of-the-art performance. Figure 2. Training procedure of object detection network based on the Perceptual GAN. As shown in Figure 2, the generator network aims to generate super-resoved representation for the small object. The discriminator includes two branches, i.e. 1. the adversarial branch for differentiating between the generated superresolved representation. 2. the perception branch for justifying the detection accurcy benefiting from the generation representation. ==>> Dicriminative Network Architecture: The D network need to justify the dection accuracy benefiting from the generated super-resovled feature. Given the adversarial loss $L_{dis_a}$ and the perceptual loss $L_{dis_p}$ , a final loss function Ldis can be produced as weighted sum of both individual loss components. Given weighting parameters w1 and w2, we define Ldis = w1 × Ldis_a + w2 × Ldis_p to encourage the generator network to generate super-resolved representation with high detection accuracy. Here we set both w1 and w2 to be one.
论文笔记之：Natural Language Object Retrieval 2017-07-10 16:50:43 本文旨在通过给定的文本描述，在图像中去实现物体的定位和识别。大致流程图如下： 此处，作者强调了一点不同之处： Natural language object retrieval differs from text-based image retrieval task as it involves spatial information about objects within the scene and global scene context. （自然语言物体的检索 与 基于文本的图像检索任务 是不同的，因为其涉及到：在场景内部的关于物体的空间信息，以及全局的场景信息）。本文通过 recurrent network 来实现 query text, local image descriptor, spatial configurations and global context features，然后输出是：文本和 proposal 之间的相符程度的得分。与此同时，也可以将 visual-linguistic knowledge 从 image caption 领域借鉴到我们的任务当中。 作者发现：简单的将 text-based image retrieval system 来直接做这个任务，得到的效果并不是非常好，因为自然语言物体检索涉及到 物体的空间信息 以及 场景中全局信息。利用 RNN 作为 scoring function有如下的好处： 1. 整个模型可以通过 反向传播 来进行end to end 的训练，使得 visual feature extraction 和 text sequence embedding 可以相互影响。实验表明这种方向比 bag of words 效果要好很多。 2. 可以很简单的利用 大型 image-text datasets 来学习一个 vision-language model 来协助该任务的完成。 但是，这个任务有一个比较大的挑战是：the lack of large scale datasets with annotated object bounding box and description pairs. To address this issue, we show that it allows us to transfer visual-linguistic knowledge learned from the former task to the latter one by first pretraining on the image caption domain and then adapting it to the natural language object retrieval domain. 这种 pre-training 和 adaptation 的过程不但提升了性能，而且避免了过拟合，特别是当 the object retrieval training dataset 比较小的时候。 本文的网络结构 和 大致示意图 如下所示： 训练所用到的损失函数为： 简单而言，其实就是: 利用给定的图像，BBOX的位置信息，以及检索的语言。。。 然后基于此给出一个网络结构的预测，在去比较该结果和给定的语言描述之间的 loss 。。。 从而完成整个网络的训练。。。。 在测试的时候，就可以将 proposal 替换掉 原始 GT image patch，然后就可以利用这个语言模型，给各个 proposal 进行打分了。。。 最终选择一个最佳的 proposal 作为检测的结果。。。
In Defense of the Triplet Loss for Person Re-Identification 2017-07-02 14:04:20 This blog comes from: http://blog.csdn.net/shuzfan/article/details/70069822 Paper: https://arxiv.org/abs/1703.07737 Github: https://github.com/VisualComputingInstitute/triplet-reid Introduction Re-ID和图像检索有点类似。这样来看，Google的FaceNet利用Triplet Loss训练的高度嵌入的特征，似乎很适合做这样大范围的快速比对。 但是，很多的研究文献表明常见的classification或者结合verification Loss比Triplet Loss似乎更适合这个任务。 他们通常将CNN作为特征提取器，后面再接专门的测度模型。但是这两种Loss有着明显的缺点： Classification Loss： 当目标很大时，会严重增加网络参数，而训练结束后很多参数都会被摒弃。 Verification Loss： 只能成对的判断两张图片的相似度，因此很难应用到目标聚类和检索上去。因为一对一对比太慢。 但是 Triplet Loss还是很吸引人啊： 端到端，简单直接； 自带聚类属性； 特征高度嵌入。 为什么Triplet训不好呢或者说不好训呢？ 首先需要了解，hard mining在Triplet训练中是一个很重要的步骤。 没有hard mining会导致训练阻塞收敛结果不佳，选择过难的hard又会导致训练不稳定收敛变难。此外，hard mining也比较耗时而且也没有清楚的定义什么是 “Good Hard”。 文章的贡献主要有两个方面： (1) 设计了新的Triplet Loss，并和其它变种进行了对比。 (2) 对于是否需要 pre-trained模型，进行了实验对比分析。 Triplet Loss 这一小节主要介绍几种Triplet 变种。 Large Margin Nearest Neighbor loss 比较早的Triplet形式(参考文献[1])。 \(L_{pull}\) 表示拉近属于同一目标的样本； \(L_{push}\) 表示拉远不同目标的样本。 由于是最近邻分类，所以同一类当中可能有多个cluster，而且固定的cluster中心也比较难以确定。 FaceNet Triplet Loss Google的人脸认证模型FaceNet(参考文献[2]), 不要求同类目标都靠近某个点，只要同类距离大于不同类间距离就行。完美的契合人脸认证的思想。 Batch All Triplet Loss FaceNet Triplet Loss训练时数据是按顺序排好的3个一组3个一组。假如batch_size=3B,那么实际上有多达 \(6B^2-4B\)种三元组组合，仅仅利用B组就很浪费。 所以我们可以首先改变一下数据的组织方式：\(batch\ size = K\times B\),即随机地选出K个人，每个人随机地选B张图片。 这样总共会有 \(PK(PK-K)(K-1)\)种组合，计算loss时就按照下式统计所有可能。 Batch Hard Triplet Loss Batch All Triplet Loss看起来一次可以处理非常多的三元组，但是有一点很尴尬：数据集非常大时训练将会非常耗时，同时随着训练深入很多三元组因为很容易被分对而变成了“无用的”三元组。 怎么办？ Hard Mining. 但是，过难得三元组又会导致训练不稳定，怎么办？ Mining Moderate Hard. 作者定义了下面的“较难”的Triplet Loss，之所以是“较难”，是因为只是在一个小的Batch里面选的最难的。 其中 \(x_j^i\) 表示第 \(i\) 个人的第 \(j\)张图片。 Lifted Embedding Loss 文献[3]针对3个一组3个一组排列的batch，提出了一种新的Loss：将anchor-positive pair之外的所有样本作为negative，然后优化Loss的平滑边界。 文章针对 \(batch\ size = K\times B\)的形式对上式稍作改进： Distance Measure 很多相关工作中，都使用平方欧式距离 \(D(a,b) = |a-b|_2^2\) 作为度量函数。 作者虽然没有系统对比过其它度量函数，但是在实验中发现非平方欧氏距离 \(D(a,b) = |a-b|_2\) 表现的更为稳定。 同时，使用非平方欧氏距离使得margin 这个参数更具有可读性。 Soft-margin 之前的很多Triplet Loss都采用了截断处理，即如果Triplet三元组关系正确则Loss直接为0。 作者发现，对于Re-ID来说，有必要不断地拉近同类目标的距离。因此，作者设计了下面的soft-margin函数： \(s(x) = ln(1+e^x)\) Experiments 多种Triplet Loss性能对比 (1) 没有Hard Mining的 \(L_{tri}\)往往模型效果不好，如果加上简单的offline hard-mining(OHM)，则效果很不稳定，有时候很好有时候完全崩掉。 (2) Batch Hard形式的 \(L_{BH}\)整体表现好于 Batch All形式的 \(L_{BA}\)。作者猜测，训练后期很多三元组loss都是0，然后平均处理时会把仅剩的有用的信息给稀释掉。为了证明该猜想，作者计算平均loss时只考虑那些不为0的，用 \(L_{BA\neq 0}\)表示，发现效果确实会变好。 (3) 在作者的Re-ID实验中，Batch Hard + soft-margin的效果最好，但是不能保证在其他任务中这种组合依然是最好的，这需要更多的实验验证。 To Pretrain or not to Pretrain? TriNet表示来自pre-trained model，LuNet是作者自己设计的一个普通网络。 从上面的表格来看，利用pre-trained model确实可以获得更好一点的效果，但是从头开始训练的网络也不会太差。 特别的，pre-trained model往往体积较大模式固定，不如自己设计网络来的灵活。同时，pre-trained model往往有其自己的固定输入，我们如果修改其输入很可能会得到相反的效果。如下表： Trick (1) 没有必要对输出特征进行归一化； (2) 如果使用了hard mining, 单纯的看loss变化往往不能正确把握训练的进程。作者推荐观察一个batch中的有效三元组个数，或者所有pair间的距离。 (3) 初始margin不宜过大； 参考文献 [1] K. Q. Weinberger and L. K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR,10:207–244, 2009 [2] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In CVPR, 2015 [3] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. In CVPR, 2016
Attention in Long Short-Term Memory Recurrent Neural Networks by Jason Brownlee on June 30, 2017 in Deep Learning The Encoder-Decoder architecture is popular because it has demonstrated state-of-the-art results across a range of domains. A limitation of the architecture is that it encodes the input sequence to a fixed length internal representation. This imposes limits on the length of input sequences that can be reasonably learned and results in worse performance for very long input sequences. In this post, you will discover the attention mechanism for recurrent neural networks that seeks to overcome this limitation. After reading this post, you will know: The limitation of the encode-decoder architecture and the fixed-length internal representation. The attention mechanism to overcome the limitation that allows the network to learn where to pay attention in the input sequence for each item in the output sequence. 5 applications of the attention mechanism with recurrent neural networks in domains such as text translation, speech recognition, and more. Let’s get started. Attention in Long Short-Term Memory Recurrent Neural NetworksPhoto by Jonas Schleske, some rights reserved. Problem With Long Sequences The encoder-decoder recurrent neural network is an architecture where one set of LSTMs learn to encode input sequences into a fixed-length internal representation, and second set of LSTMs read the internal representation and decode it into an output sequence. This architecture has shown state-of-the-art results on difficult sequence prediction problems like text translation and quickly became the dominant approach. For example, see: Sequence to Sequence Learning with Neural Networks, 2014 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014 The encoder-decoder architecture still achieves excellent results on a wide range of problems. Nevertheless, it suffers from the constraint that all input sequences are forced to be encoded to a fixed length internal vector. This is believed to limit the performance of these networks, especially when considering long input sequences, such as very long sentences in text translation problems. A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. — Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 Attention within Sequences Attention is the idea of freeing the encoder-decoder architecture from the fixed-length internal representation. This is achieved by keeping the intermediate outputs from the encoder LSTM from each step of the input sequence and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence. Put another way, each item in the output sequence is conditional on selective items in the input sequence. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words. … it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. — Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 This increases the computational burden of the model, but results in a more targeted and better-performing model. In addition, the model is also able to show how attention is paid to the input sequence when predicting the output sequence. This can help in understanding and diagnosing exactly what the model is considering and to what degree for specific input-output pairs. The proposed approach provides an intuitive way to inspect the (soft-)alignment between the words in a generated translation and those in a source sentence. This is done by visualizing the annotation weights… Each row of a matrix in each plot indicates the weights associated with the annotations. From this we see which positions in the source sentence were considered more important when generating the target word. — Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 Problem with Large Images Convolutional neural networks applied to computer vision problems also suffer from similar limitations, where it can be difficult to learn models on very large images. As a result, a series of glimpses can be taken of a large image to formulate an approximate impression of the image before making a prediction. One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making. — Recurrent Models of Visual Attention, 2014 These glimpse-based modifications may also be considered attention, but are not considered in this post. See the papers. Recurrent Models of Visual Attention, 2014 DRAW: A Recurrent Neural Network For Image Generation, 2014 Multiple Object Recognition with Visual Attention, 2014 5 Examples of Attention in Sequence Prediction This section provides some specific examples of how attention is used for sequence prediction with recurrent neural networks. 1. Attention in Text Translation The motivating example mentioned above is text translation. Given an input sequence of a sentence in French, translate and output a sentence in English. Attention is used to pay attention to specific words in the input sequence for each word in the output sequence. We extended the basic encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word. This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word. — Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 Attentional Interpretation of French to English TranslationTaken from Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 2. Attention in Image Descriptions Different from the glimpse approach, the sequence-based attentional mechanism can be applied to computer vision problems to help get an idea of how to best use the convolutional neural network to pay attention to images when outputting a sequence, such as a caption. Given an input of an image, output an English description of the image. Attention is used to pay focus on different parts of the image for each word in the output sequence. We propose an attention based approach that gives state of the art performance on three benchmark datasets … We also show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition. Attentional Interpretation of Output Words to Specific Regions on the Input ImagesTaken from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016 — Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016 3. Attention in Entailment Given a premise scenario and a hypothesis about the scenario in English, output whether the premise contradicts, is not related, or entails the hypothesis. For example: premise: “A wedding party taking pictures“ hypothesis: “Someone got married“ Attention is used to relate each word in the hypothesis to words in the premise, and vise-versa. We present a neural model based on LSTMs that reads two sentences in one go to determine entailment, as opposed to mapping each sentence independently into a semantic space. We extend this model with a neural word-by-word attention mechanism to encourage reasoning over entailments of pairs of words and phrases. … An extension with word-by-word neural attention surpasses this strong benchmark LSTM result by 2.6 percentage points, setting a new state-of-the-art accuracy… — Reasoning about Entailment with Neural Attention, 2016 Attentional Interpretation of Premise Words to Hypothesis Words Taken from Reasoning about Entailment with Neural Attention, 2016 4. Attention in Speech Recognition Given an input sequence of English speech snippets, output a sequence of phonemes. Attention is used to relate each phoneme in the output sequence to specific frames of audio in the input sequence. … a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the next position in the input sequence for decoding. One desirable property of the proposed model is that it can recognize utterances much longer than the ones it was trained on. — Attention-Based Models for Speech Recognition, 2015. Attentional Interpretation of Output Phoneme Location to Input Frames of AudioTaken from Attention-Based Models for Speech Recognition, 2015 5. Attention in Text Summarization Given an input sequence of an English article, output a sequence of English words that summarize the input. Attention is used to relate each word in the output summary to specific words in the input document. … a neural attention-based model for abstractive summarization, based on recent developments in neural machine translation. We combine this probabilistic model with a generation algorithm which produces accurate abstractive summaries. — A Neural Attention Model for Abstractive Sentence Summarization, 2015 Attentional Interpretation of Words in the Input Document to the Output SummaryTaken from A Neural Attention Model for Abstractive Sentence Summarization, 2015. Further Reading This section provides additional resources if you would like to learn more about adding attention to LSTMs. Attention and memory in deep learning and NLP Attention Mechanism Survey on Attention-based Models Applied in NLP What is exactly the attention mechanism introduced to RNN? on Quora. What is Attention Mechanism in Neural Networks? Keras does not offer attention out of the box at the time of writing, but there are few third-party implementations. See: Deep Language Modeling for Question Answering using Keras Attention Model Available! Keras Attention Mechanism Attention and Augmented Recurrent Neural Networks How to add Attention on top of a Recurrent Layer (Text Classification) Attention Mechanism Implementation Issue Implementing simple neural attention model (for padded inputs) Attention layer requires another PR seq2seq library Do you know of some good resources on attention in recurrent neural networks?Let me know in the comments. Summary In this post, you discovered the attention mechanism for sequence prediction problems with LSTM recurrent neural networks. Specifically, you learned: That the encoder-decoder architecture for recurrent neural networks uses a fixed-length internal representation that imposes a constraint that limits learning very long sequences. That attention overcomes the limitation in the encode-decoder architecture by allowing the network to learn where to pay attention to the input for each item in the output sequence. That the approach has been used across different types sequence prediction problems include text translation, speech recognition, and more. Do you have any questions about attention in recurrent neural networks?Ask your questions in the comments below and I will do my best to answer.
Deep Attributes Driven Multi-Camera Person Re-identification 2017-06-28 21:38:55 【Motivation】 本文的网络设计主要分为三个部分： Stage 1: Fully-supervised dCNN training Stage 2: Fine-tuning using attributes triplet loss Stage 3:Final fine-tuning on the combined dataset 大致的网络网络结构以及流程图，如下所示： 从这里，可以大致看出本文 属性识别的方法：sem-supervised learning 的方法来提升 网络识别能力，之所以识别能力不强，是因为现有的 attribute dataset 都是非常小的，这就导致了 deep neural network 训练不够充分，而手工标注这些数据又非常的困难，耗时费力。比较尴尬。。。 本文首先在全部标注的行人属性识别网络上进行 监督训练，以得到初始的【属性识别网络】，此时的识别网络能力是不足的，即：weak。如何进一步的提升属性识别的能力呢？本文考虑到这么一个现象【同一个人的属性，应该有类似的属性识别结果】，基于该观察，作者利用 triplet loss function，在 instance 的基础上进行属性识别能力的改进： 【三元组的构建】 1. select an anchor sample 2. select another positive sample with the same person ID ; 3. select a negative sample with different person ID. 而这部分网络训练的目标是：使得同一个人的属性输出结果尽可能的一致，而不同 instance 之间的属性输出结果差距尽可能的大，作者称该 triplet loss 为：attribute triplet loss。这部分的目标函数为： 其中，D(.) 代表两个二元属性向量之间的距离函数，所以对应的损失函数可以表达为： 其中 E 表示 triplet 的个数。 但是，作者提到上述损失函数可能存在一些问题：the person ID label is not strong enough to train the dCNN with accurate attributes. Without proper constraints, the above loss function may generate meaningless attribute labels and easily overfit the training dataset U. 于是，作者在上述损失函数的基础上添加了几条规则化项： 公式（4）不仅确保了同一个人拥有相似的属性，并且避免了meaningless attribute。 【在组合的数据集上进行微调】： 用第二部分得到微调后的网络，预测一部分无标签数据，并且将这部分无标签数据 和 原始标注好的数据，一起来微调 属性识别网络。 最后，就是如何利用这些属性进行最终的再识别？？？ 其实就是依赖于 属性之间的差距，借用作者摘要当中的话来说就是： By directly using the deep attributes with simple Cosine distance, we have obtained surprisingly good accuracy on four person ReID datasets. Experiments also show that a simple distance metric learning modular further boosts our method, making it significantly outperform many recent works.
NLP related basic knowledge with deep learning methods 2017-06-22 First things first >>>>>>>>>>>>>>>>>>>>>>>> Some great blogs: 1. https://github.com/udacity/deep-learning/blob/master/embeddings/Skip-Gram_word2vec.ipynb 2. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 3. http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/ 4. https://github.com/udacity/deep-learning/blob/master/sentiment-rnn/Sentiment_RNN.ipynb 5. https://github.com/mchablani/deep-learning Second >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Skip-Thought Vectors: 1. 无监督的表示模型，做 sentence-level，seq2seq model ... 该方法的能够 work 的原因在于下面的这幅图： 该方法的两个主要部分：encoder-decoder，不同的是 这里有两个 decoder，分别用于解码当前句子的前一句 和 后一句。网络的训练 loss 的定义就是两个 decoder 部分 loss 的叠加： 该方法的另一个问题在于：如何处理网络并未见过的 word ？ 因为该网络的 encoder 部分可以将 文本 转化为 feature，但是可能有些 words 并未见过，如何编码这些 words 呢？本文利用 word2vector 的方法，将该机制中的 word 通过一个 映射函数 W 来进行转移，利用 L2 线性逻辑回归损失函数 来学习该 matrix W。 reference paper: (1). http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf (2). blog: http://chuansong.me/n/478040352820 2.
Optical Flow Estimation using a Spatial Pyramid Network spynet 本文将经典的 spatial-pyramid formulation 和 deep learning 的方法相结合，以一种 coarse to fine approach，进行光流的计算。This estiamates large motions in a coarse to fine approach by warping one image of a pair at each pyramid level by the current flow estimate and compute an update to the flow. 我们利用 CNN 来进行每一层 flow 的更新，而不是传统方法中目标函数的最小化。与 FlowNet 相比，本文的方法不需要处理 large motions；这些已经在 pyramid 中处理了。该方法的主要优势有： 1. our Spatial Pyramid Network is much simpler and 96% smaller than FlowNet in terms of model parameters. 2. since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. 3. unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. 现有方法存在的 主要问题： 将两张图直接 stack大一起，放到 CNN 当中。当两帧图像之间的 motion 大于 one or a few pixels， spatial-temporal convolutional filters 将不会收到有效的相应。也就是说，if a convolutional window in one image does not overlap with related image pixels at the next time instant, no meaningful temporal filter can be learned. 这里需要解决两个关键性的问题：1. 长期依赖的问题； 2. detailed, sub-pixel, optical flow and precise motion boundaries。FlowNet 是尝试在一个网络中解决这两个问题，而该方法则是用 CNN 来解决第二个问题，用现有的方法来解决第一个问题。 Approach： 本文用 spatial pyramid 的方式，from coarse to fine 的方法来解决 large motion的问题。 其流程图如下所示： 在训练上一层网络 G 的时候，需要下面几层的初始 flow 结果。而本文得到训练所需的 gt，是根据 gt flow 和 下一层光流图上采样后的结果 之间的差值的得到的。根据这个，来训练当前的网络参数。
Introductory guide to Generative Adversarial Networks (GANs) and their promise! Introduction Neural Networks have made great progress. They now recognize images and voice at levels comparable to humans. They are also able to understand natural language with a good accuracy. But, even then, the talk of automating human tasks with machines looks a bit far fetched. After all, we do much more than just recognizing image / voice or understanding what people around us are saying – don’t we? Let us see a few examples where we need human creativity (at least as of now): Train an artificial author which can write an article and explain data science concepts to a community in a very simplistic manner by learning from past articles on Analytics Vidhya You are not able to buy a painting from a famous painter which might be too expensive. Can you create an artificial painter which can paint like any famous artist by learning from his / her past collections? Do you think, these tasks can be accomplished by machines? Well – the answer might surprise you These are definitely difficult to automate tasks, but Generative Adversarial Networks (GANs) have started making some of these tasks possible. If you feel intimidated by the name GAN – don’t worry! You will feel comfortable with them by end of this article. In this article, I will introduce you to the concept of GANs and explain how they work along with the challenges. I will also let you know of some cool things people have done using GAN and give you links to some of the important resources for getting deeper into these techniques. Excuse me, but what is a GAN? Yann LeCun, a prominent figure in Deep Learning Domain said in his Quora session that “(GANs), and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” Surely he has a point. When I saw the implicationsGenerative Adversarial Networks (GANs) can have if they were executed to their fullest extent, I was impressed too. But what is a GAN? Let us take an analogy to explain the concept: If you want to get better at something, say chess; what would you do? You would compete with an opponent better than you. Then you would analyze what you did wrong, what he / she did right, and think on what could you do to beat him / her in the next game. You would repeat this step until you defeat the opponent. This concept can be incorporated to build better models. So simply, for getting a powerful hero (viz generator), we need a more powerful opponent (viz discriminator)! Another analogy from real life A slightly more real analogy can be considered as a relation between forger and an investigator. The task of a forger is to create fraudulent imitations of original paintings by famous artists. If this created piece can pass as the original one, the forger gets a lot of money in exchange of the piece. On the other hand, an art investigator’s task is to catch these forgers who create the fraudulent pieces. How does he do it? He knows what are the properties which sets the original artist apart and what kind of painting he should have created. He evaluates this knowledge with the piece in hand to check if it is real or not. This contest of forger vs investigator goes on, which ultimately makes world class investigators (and unfortunately world class forger); a battle between good and evil. How do GANs work? We got a high level overview of GANs. Now, we will go on to understand their nitty-gritty of these things. As we saw, there are two main components of a GAN – Generator Neural Network and Discriminator Neural Network. The Generator Network takes an random input and tries to generate a sample of data. In the above image, we can see that generator G(z) takes a input z from p(z), where z is a sample from probability distribution p(z). It then generates a data which is then fed into a discriminator network D(x). The task of Discriminator Network is to take input either from the real data or from the generator and try to predict whether the input is real or generated. It takes an input x from pdata(x) where pdata(x) is our real data distribution. D(x) then solves a binary classification problem using sigmoid function giving output in the range 0 to 1. Let us define the notations we will be using to formalize our GAN, Pdata(x) -> the distribution of real dataX -> sample from pdata(x)P(z) -> distribution of generatorZ -> sample from p(z)G(z) -> Generator NetworkD(x) -> Discriminator Network Now the training of GAN is done (as we saw above) as a fight between generator and discriminator. This can be represented mathematically as In our function V(D, G) the first term is entropy that the data from real distribution (pdata(x)) passes through the discriminator (aka best case scenario). The discriminator tries to maximize this to 1. The second term is entropy that the data from random input (p(z)) passes through the generator, which then generates a fake sample which is then passed through the discriminator to identify the fakeness (aka worst case scenario). In this term, discriminator tries to maximize it to 0 (i.e. the log probability that the data from generated is fake is equal to 0). So overall, the discriminator is trying to maximize our function V. On the other hand, the task of generator is exactly opposite, i.e. it tries to minimize the function V so that the differentiation between real and fake data is bare minimum. This, in other words is a cat and mouse game between generator and discriminator! Note: This method of training a GAN is taken from game theory called the minimax game. Parts of training GAN So broadly a training phase has two main subparts and they are done sequentially Pass 1: Train discriminator and freeze generator (freezing means setting training as false. The network does only forward pass and no backpropagation is applied) Pass 2: Train generator and freeze discriminator Steps to train a GAN Step 1: Define the problem. Do you want to generate fake images or fake text. Here you should completely define the problem and collect data for it. Step 2: Define architecture of GAN. Define how your GAN should look like. Should both your generator and discriminator be multi layer perceptrons, or convolutional neural networks? This step will depend on what problem you are trying to solve. Step 3: Train Discriminator on real data for n epochs. Get the data you want to generate fake on and train the discriminator to correctly predict them as real. Here value n can be any natural number between 1 and infinity. Step 4: Generate fake inputs for generator and train discriminator on fake data. Get generated data and let the discriminator correctly predict them as fake. Step 5: Train generator with the output of discriminator. Now when the discriminator is trained, you can get its predictions and use it as an objective for training the generator. Train the generator to fool the discriminator. Step 6: Repeat step 3 to step 5 for a few epochs. Step 7: Check if the fake data manually if it seems legit. If it seems appropriate, stop training, else go to step 3. This is a bit of a manual task, as hand evaluating the data is the best way to check the fakeness. When this step is over, you can evaluate whether the GAN is performing well enough. Now just take a breath and look at what kind of implications this technique could have. If hypothetically you had a fully functional generator, you can duplicate almost anything. To give you examples, you can generate fake news; create books and novels with unimaginable stories; on call support and much more. You can have artificial intelligence as close to reality; a true artificial intelligence! That’s the dream!! Challenges with GANs You may ask, if we know what could these beautiful creatures (monsters?) do; why haven’t something happened? This is because we have barely scratched the surface. There’s so many roadblocks into building a “good enough” GAN and we haven’t cleared many of them yet. There’s a whole area of research out there just to find “how to train a GAN” The most important roadblock while training a GAN is stability. If you start to train a GAN, and the discriminator part is much powerful that its generator counterpart, the generator would fail to train effectively. This will in turn affect training of your GAN. On the other hand, if the discriminator is too lenient; it would let literally any image be generated. And this will mean that your GAN is useless. Another way to glance at stability of GAN is to look as a holistic convergence problem. Both generator and discriminator are fighting against each other to get one step ahead of the other. Also, they are dependent on each other for efficient training. If one of them fails, the whole system fails. So you have to make sure they don’t explode. This is kind of like the shadow in Prince of Persia game . You have to defend yourself from the shadow, which tries to kill you. If you kill the shadow you die, but if you don’t do anything, you will definitely die! There are other problems too, which I will list down here. (Reference: http://www.iangoodfellow.com/slides/2016-12-04-NIPS.pdf) Note: Below mentioned images are generated by a GAN trained on ImageNet dataset. Problem with Counting: GANs fail to differentiate how many of a particular object should occur at a location. As we can see below, it gives more number of eyes in the head than naturally present. Problems with Perspective: GANs fail to adapt to 3D objects. It doesn’t understand perspective, i.e.difference between frontview and backview. As we can see below, it gives flat (2D) representation of 3D objects. Problems with Global Structures: Same as the problem with perspective, GANs do not understand a holistic structure. For example, in the bottom left image, it gives a generated image of a quadruple cow, i.e. a cow standing on its hind legs and simultaneously on all four legs. That is definitely not possible in real life! A substantial research is being done to take care of these problems. Newer types of models are proposed which give more accurate results than previous techniques, such as DCGAN, WassersteinGAN etc Implementing a Toy GAN Lets see a toy implementation of GAN to strengthen our theory. We will try to generate digits by training a GAN on Identify the Digits dataset. A bit about the dataset; the dataset contains 28×28 images which are black and white. All the images are in “.png” format. For our task, we will only work on the training set. You can download the dataset from here. You also need to setup the libraries , namely numpy pandas tensorflow keras keras_adversarial Before starting with the code, let us understand the internal working thorugh pseudocode. A pseudocode of GAN training can be thought out as follows Source: http://papers.nips.cc/paper/5423-generative-adversarial Note: This is the first implementation of GAN that was published in the paper. Numerous improvements/updates in the pseudocode can be seen in the recent papers such as adding batch normalization in the generator and discrimination network, training generator k times etc. Now lets start with the code! Let us first import all the modules # import modules %pylab inline import os import numpy as np import pandas as pd from scipy.misc import imread import keras from keras.models import Sequential from keras.layers import Dense, Flatten, Reshape, InputLayer from keras.regularizers import L1L2 To have a deterministic randomness, we set a seed value # to stop potential randomness seed = 128 rng = np.random.RandomState(seed) We set the path of our data and working directory # set path root_dir = os.path.abspath('.') data_dir = os.path.join(root_dir, 'Data') Let us load our data # load data train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv')) test = pd.read_csv(os.path.join(data_dir, 'test.csv')) temp = [] for img_name in train.filename: image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name) img = imread(image_path, flatten=True) img = img.astype('float32') temp.append(img) train_x = np.stack(temp) train_x = train_x / 255. To visualize what our data looks like, let us plot one of the image # print image img_name = rng.choice(train.filename) filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name) img = imread(filepath, flatten=True) pylab.imshow(img, cmap='gray') pylab.axis('off') pylab.show() Define variables which we will be using later # define variables # define vars g_input_shape = 100 d_input_shape = (28, 28) hidden_1_num_units = 500 hidden_2_num_units = 500 g_output_num_units = 784 d_output_num_units = 1 epochs = 25 batch_size = 128 Now define our generator and discriminator networks # generator model_1 = Sequential([ Dense(units=hidden_1_num_units, input_dim=g_input_shape, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-5)), Dense(units=hidden_2_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-5)), Dense(units=g_output_num_units, activation='sigmoid', kernel_regularizer=L1L2(1e-5, 1e-5)), Reshape(d_input_shape), ]) # discriminator model_2 = Sequential([ InputLayer(input_shape=d_input_shape), Flatten(), Dense(units=hidden_1_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-5)), Dense(units=hidden_2_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-5)), Dense(units=d_output_num_units, activation='sigmoid', kernel_regularizer=L1L2(1e-5, 1e-5)), ]) Here is the architecture of our networks We will then define our GAN, for that we will first import a few important modules from keras_adversarial import AdversarialModel, simple_gan, gan_targets from keras_adversarial import AdversarialOptimizerSimultaneous, normal_latent_sampling Let us compile our GAN and start the training gan = simple_gan(model_1, model_2, normal_latent_sampling((100,))) model = AdversarialModel(base_model=gan,player_params=[model_1.trainable_weights, model_2.trainable_weights]) model.adversarial_compile(adversarial_optimizer=AdversarialOptimizerSimultaneous(), player_optimizers=['adam', 'adam'], loss='binary_crossentropy') history = model.fit(x=train_x, y=gan_targets(train_x.shape[0]), epochs=10, batch_size=batch_size) Here’s how our GAN would look like, We get a graph like after training for 10 epochs. plt.plot(history.history['player_0_loss']) plt.plot(history.history['player_1_loss']) plt.plot(history.history['loss']) After training for 100 epochs, I got the following generated images zsamples = np.random.normal(size=(10, 100)) pred = model_1.predict(zsamples) for i in range(pred.shape[0]): plt.imshow(pred[i, :], cmap='gray') plt.show() And voila! You have built your first generative model! Applications of GAN We saw an overview of how these things work and got to know the challenges of training them. We will now see the cutting edge research that has been done using GANs Predicting the next frame in a video : You train a GAN on video sequences and let it predict what would occur nextPaper : https://arxiv.org/pdf/1511.06380.pdf Increasing Resolution of an image : Generate a high resolution photo from a comparatively low resolution.Paper: https://arxiv.org/pdf/1609.04802.pdf Interactive Image Generation : Draw simple strokes and let the GAN draw an impressive picture for you!Link: https://github.com/junyanz/iGAN Image to Image Translation : Generate an image from another image. For example, given on the left, you have labels of a street scene and you can generate a real looking photo with GAN. On the right, you give a simple drawing of a handbag and you get a real looking drawing of a handbag.Paper: https://arxiv.org/pdf/1611.07004.pdf Text to Image Generation : Just say to your GAN what you want to see and get a realistic photo of the target.Paper : https://arxiv.org/pdf/1605.05396.pdf Resources Here are some resources which you might find helpful to get more in-depth on GAN List of Papers published on GANs A Brief Chapter on Deep Generative Modelling Workshop on Generative Adversarial Network by Ian Goodfellow NIPS 2016 Workshop on Adversarial Training End Notes Phew! I hope you are now as excited about the future as I was when I first read about GANs. They are set to change what machines can do for us. Think of it – from preparing new recipes of food to creating drawings. The possibilities are endless. In this article, I tried to cover a general overview of GAN and its applications. GAN is very exciting area and that’s why researchers are so excited about building generative models and you can see that new papers on GANs are coming out more frequently. If you have any questions on GANs, please feel free to share them with me through comments.
DualGAN: Unsupervised Dual Learning for Image-to-Image Translation2017-06-12 21:29:06 引言部分： 本文提出一种对偶学习模式的 GAN 网络结构来进行 image to image translation。现有的图像之间转换的方法，大部分都是需要图像对的方法，但是实际上有的场景下，很难得到这样的图像对。如何利用多个 domain 之间的关系，不需要图像对就可以进行图像之间的转换，那将会是一个很 cool 的工作，而本文就是将 GAN 和 Dualing Learning 结合起来完成了该项目，从效果来看，还是可以的。 关于 Dualing Learning： 主要是参考了 NIPS 2016 年的一篇文章，做机器翻译的。是想将 domain A 到 domain B 之间的转换，构成一个闭环（loop）。通过 minimize 该图 和 重构图像之间的 loss 来优化学习的目标。这里也是，给定一个 domain image A，用一个产生器 P 来生成对应的 domain image B，由于没有和A匹配的图像对，这里是没有 GT 的。那么如何衡量 产生器造出的图像 P(A, z) 的质量呢？如何该图伪造的很好，那么反过来，用另一个 产生器 Q，应该可以很好的恢复出该图，即：Q(P(A, z), z') 应该和 A 是类似的，即：|| Q(P(A, z), z') - A ||。对于 domain image B 也是如此，那么有了另一个 重构误差。 这样，除了在 minimize 两个 产生器的loss的同时，也需要考虑到这两个重构误差，从而使得最终转换的结果有保证。 ==>> Training Target: 1. 用 L1 loss 来尽可能使得图像清晰； 2. 用 两个 GAN 来实现 domain 之间的切换；
SST: Single-Stream Temporal Action Proposals2017-06-11 14:28:00 本文提出一种 时间维度上的 proposal 方法，进行行为的识别。本文方法具有如下的几个特点： 1. 可以处理 long video sequence，只需要一次前向传播就可以处理完毕整个video；可以处理任意长度的 video，而不需要处理重叠的时间窗口； 2. 在 proposal generation task 上取得了顶尖的效果； 3. SST proposals 提供了一个较强的基准，进行 temporal action localization，将该方法结合到现有的分类任务中，可以改善分类的性能。 所提出方法的流程图如下所示： Technical Approach: 我们所要达到的目标是：在一个 long video 上产生 temporal action proposals。 网络的几个重要的部分： 1. Visual Encoder (C3D) 用于编码 video frame，感知输入 video ； 2. Seq.Encoder (GRU) 的输入是 降维后的 C3D feature，设计该模块的目的是： accumulate evidence across time as the video sequence progresses. 为了能够更好的产生 good proposals，该模块应该能够收集信息直到确定某个动作已经发生了，与此同时，扔掉不相关的背景信息。 Training: 由于行为识别本身就是一个多分类问题，所以这里用到了 交叉熵损失函数来作为最终 loss function。 而总的 loss 就是该 loss 的加和： 数据集提供了裁剪好的 video，所以就是给定 gt 的监督训练任务，完全可以用反向传播算法进行训练。 Reference: 1. Paper: http://vision.stanford.edu/pdf/buch2017cvpr.pdf 2. Github: https://github.com/ranjaykrishna/SST
论文笔记之：Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning 2017-06-06 21:43:53 这篇文章的 Motivation 来自于 MDNet： 本文所提出的 framework 为：
Tutorial: Generate BBox or Rectangle to locate the target obejct 1 clc;close all;clear all; 2 Img=imread('/home/wangxiao/Documents/files/Visual_Tracking/MDNet-CVPR2016/MDNet-master/attentionMap/Basketball/0001.png'); 3 if ndims(Img)==3 4 I=rgb2gray(Img); 5 else 6 I=Img; 7 end 8 I=im2bw(I,graythresh(I)); 9 [m,n]=size(I); 10 imshow(I);title('binary image'); 11 txt=get(gca,'Title'); 12 set(txt,'fontsize',16); 13 L=bwlabel(I); 14 stats=regionprops(L,'all'); 15 set(gcf,'color','w'); 16 set(gca,'units','pixels','Visible','off'); 17 q=get(gca,'position'); 18 q(1)=0;%设置左边距离值为零 19 q(2)=0;%设置右边距离值为零 20 set(gca,'position',q); 21 for i=1:length(stats) 22 hold on; 23 rectangle('position',stats(i).BoundingBox,'edgecolor','y','linewidth',2); 24 temp = stats(i).Centroid; 25 plot(temp(1),temp(2),'r.'); 26 drawnow; 27 end 28 frame=getframe(gcf,[0,0,n,m]); 29 im=frame2im(frame); 30 imwrite(im,'a.jpg','jpg');%可以修改保存的格式 Reference: 1. http://blog.sina.com.cn/s/blog_4d633dc70100o0r0.html
关于beam search 之前组会中没讲清楚的 beam search，这里给一个案例来说明这种搜索算法。 在 Image Caption的测试阶段，为了得到输出的语句，一般会选用两种搜索方式，一种是贪心采样的方法（sample），即：每个时刻都选择输出概率最大的那个单词，作为当前时刻的输出。 另一种常用的搜索方法就是：beam search。此处，借用知乎的一个案例： 假设词表大小为3，内容为a，b，c。beam search size是2，那么在decoder解码的时候： 1： 生成第1个词的时候，选择概率最大的2个词，假设为a,c,那么当前序列就是a,c 2：生成第2个词的时候，我们将当前序列a和c，分别与词表中的所有词进行组合，得到新的6个序列aa ab ac ca cb cc,然后从其中选择2个得分最高的，作为当前序列，假如为aa cb 3：后面会不断重复这个过程，直到遇到结束符为止。最终输出2个得分最高的序列。 这样，就可以根据一张图得到对应的2句描述，由于每一次都会有对应句子的总的概率输出乘积，也就可以选择一个最好的语句描述。下面是YouTube上关于该概念讲解的截图： 参考文献： https://www.zhihu.com/question/54356960 https://www.youtube.com/watch?v=UXW6Cs82UKo https://en.wikipedia.org/wiki/Beam_search
End-to-End Learning of Action Detection from Frame Glimpses in Videos CVPR 2016 Motivation: 本文主要是想借助空间的 attention model 来去协助进行行人识别的工作。作者认为 long, read-world videos 是一个非常具有挑战的视觉问题。算法必须推理出是否出现了某个 action, 并且还要在时间步骤上推出出现在什么时刻。大部分的工作都是通过构建 frame-level classifiers，通过多个时间尺度，并且采用后期处理的方法，如：duration priors and non-maximum supression。但是这种间接地动作定位的方法在时间复杂度和精确度上都有待提升。 本文提出了一种 end to end 的方法进行行为的识别，直接来推理 action的 temporal bounds。我们的关键直觉在于：行为识别的过程是一个 continuous, iterative observation and refinement. 我们可以序列的决定 where to look and how to refine our hypotheses 以得到准确的行为定位和尽可能少的搜索。 基于该想法，我们提出 single coherent model，将 long video 作为输入，输出检测到的 action instances 的 temporal bounds。我们的模型可以看做是 an agent 学习一个策略，来序列的决定和优化关于 action instances 的优化假设。本文算法也是基于 Recurrent Model of Visual Attention 这个文章来的，但是 action recognition 提出了一种新的挑战： how to handle a variable-sized set of structured detection outputs ? 为了解决这个问题，我们提出的模型可以同时决定 which frame to observe next 以及 when to emit a prediction. 在此基础上，提出了一种奖励机制来确保学习到该策略。本文是第一个提出 end to end approach 来学习视频中行为的检测。 Methods: 网络结构的设计主要包括两个部分： an observation network, and a recurrent network. 其中，观测网络主要是用来编码 video frame 的视觉表达，而 RNN 主要是用来处理这些观测以及决定下一步观测哪一帧，何时进行发射（即：进行 action 的预测）。 1. Observation Network: 像图2展示的那样，观测网络每一个时间步骤，观测到一个视频帧；将其编码为 a feature vector $o_n$，并且将其作为 RNN 的输入。 更重要的是，$o_n$ 编码了 where 和 what 的信息，即：where in the video an observation was taken 以及 what was seen。观测网络的输入是：the normalized temporal location of the observation, $l_n \in [0, 1]$，以及 对应的视频帧 $v_{l_n}$。 2. Recurrent Network: RNN 网络 fh，构成了 agent 的核心。像图2所示的那样，每一个时刻 n 的输入是：观测特征向量 $o_n$。网络的 hidden state $h_n$，是 $o_n$ 和 前一个时刻状态 $h_{n-1}$，关于 action instance，对 temporal hypotheses 进行了建模。 随着 agent 在视频中进行推理，在每一个时间步骤，有三个输出：candidate detection $d_n$, binary indicator $p_n$ 表明是否对 $d_n$ 进行发射，以得到一个预测结果，temporal location $l_{n+1}$ 表示下一个时刻需要观测的视频帧。 Candidate detection （候选检测）: 利用函数 $d_n = f_d(h_n; \theta_d)$ 得到一个候选检测 $d_n$，其中 fd 是全连接层。dn 是一个 tuple $(s_n, e_n, c_n) \in [0, 1]^3$，其中，sn 和 en 是归一化的 开始和结束的检测位置，cn 是检测的候选置信度。这个 candidate detection 代表了当前action instance周围的 agent 假设。然而，并非每一个时刻都进行检测，因为这会导致大量的 noise 以及许多 false positive。相反，the agent 利用 separate prediction indicator output 来表示候选检测应该被发射，以得到当前的 prediction。 Prediction indicator （预测指示器）: 二进制的 prediction indicator $p_n$ 表示对应的候选检测 dn 应该被发射作为一个 prediction。$P_n = f_p(h_n; \theta_p)$，其中 fp 是全连接层，紧跟着 sigmoid nonlinearity. 在训练的时候，fp is used to parameterize a Bernoulli distribution from which $p_n$ is sampled; 在测试的时候，选择最大后验预测。 候选检测 和 预测标识符 的组合 对于检测问题来说，是非常重要的，因为 positive instance 可能随处可见，也可能根本不会出现。这样就确保了该网络可以 Location of next observation （下一个观测的位置）: 时间的位置 $l_{n+1} \in [0, 1]$ 表示了 agent 选择的下一个时刻要选择的 video frame。位置不受限，agent 可以随意的在 video 上进行 skip。 位置的计算依赖于函数 $l_{n+1} = f_l (h_n; \theta_l)$，其中 fl 是全连接层， agent 的决策 是关于其过去观测的历史 以及 temporal location 的函数。在训练的时候，$l_{n+1}$ 是从 Gaussian distribution 上采样出来的；在测试的时候，the maximum a posteriori estimate is used. Training : 我们的最终目标是输出一组检测到的动作（output a set of detected actions）。为了达到这个目标，我们需要训练在每个时刻都有三个输出：candidate detection, prediction indicator, and next observation location. 给定长视频中，时序动作标注的监督，训练这些涉及到以下几个挑战： 1. suitable loss 2. reward function 3. handling non-differentiable model components. 我们这里采用 REINFORCE 的方法来训练 $p_n and L_{n+1}$ 以及 监督学习的方法来训练 $d_n$。 1. Candidate dections: 候选检测是利用反向传播来训练，已得到最大化每个 candidate 的得分。我们希望最大化 correctness，而不管是否一个 candidate 被无限的发射，因为 candidate 编码了 agent's 关于 actions 的hypotheses。这需要在训练的时候，将每一个候选 和 gt instance 进行匹配。我们利用一个观察：at each timestep, the agent should form a hyposis around the action instance (if any) nearest its current location in the video. 这使得我们可以设计一个简单有效的匹配算法。 Matching to ground truth. 给定一组候选检测 D，这些候选检测是由 RNN 走了 N 个时间步骤得到的，给定的 gt action instance g1, ... M, 每一个候选和一个 gt instance 匹配，否则，如果M=0，则为 none 。 我们定义 matching function 如下： 换句话说， Loss function （损失函数）： 一旦 candidate detections 已经和 gt instance 进行了匹配，我们优化一个 multi-task classification and localization loss function over the set D: 此处，分类项 $L_{cls(d_n)}$ 是检测置信度 $c_n$ 的 标准交叉熵损失函数，当匹配 dn 和一个 gt instance 匹配上的时候，就鼓励 the confidence 接近1，否则就是 0。 此处的 matching function 其实不是那么容易理解，要注意每个细节，不然太容易懵逼了。。。 该 loss function 是根据反向传播进行优化的。 2. Observation and emission sequences : the observation location and prediction indicator output 是不可倒的，无法利用反向传播进行求解优化。 然后作者对 RL 算法进行了简介，指出目标函数不可导的原因在于： this leads to a non-trivial optimization problem due to the high-dimensional space of possible action sequences. REINFORCE 之所以可以解决这个问题，是因为：通过MC采样，来学习网络的参数，他对梯度等式进行了估计： 其中，一般都会减去一个 baseline，以降低方差： REINFORCE learns model parameters according to this approximate gradient. The log-probability log πθ of actions leading to high future reward are increased, and those leading to low reward are decreased. Model parameters can then be updated using backpropagation. Reward Function: 此处，训练 REINFORCE 需要设计合适的奖励函数。我们的目标是学习一个策略，对于 location and prediction indicator outputs，可以得到的 action detection 更高的 recall，以及更好的精度（high precision）。所以我们的奖励函数，就设计的是要去寻找这个策略，可以使得：最大化 true positive detections, 与此同时，最小化 false positive: 在第 Nth 时间步骤，提供所有的奖励，而对于 n < N, 则全为 0，因为我们想要学习的策略可以联合的得到 high overall detection performance. M 是 gt action instance 的个数，Np 是 agent 发射的预测的个数。N+ 是 true positive predictions 的个数；N- 是false positive predictions 的个数，R+ and R- 是这些预测带来的 positive and negative rewards。 A prediction is considered correct if its overlap with a ground truth is both greater than a threshold and higher than that of any other prediction.
Awesome Courses Introduction There is a lot of hidden treasure lying within university pages scattered across the internet. This list is an attempt to bring to light those awesome courses which make their high-quality material i.e. assignments, lectures, notes, readings & examinations available online for free. Table of Contents Algorithms Artificial Intelligence Computer Graphics CS Theory Introduction to CS Machine Learning Misc Programming Languages / Compilers Security Systems Legend - Lecture Videos - Lecture Notes - Assignments / Labs - Readings Courses Systems CS 61C Great Ideas in Computer Architecture (Machine Structures) UC Berkeley The subjects covered in this course include: C and assembly language programming, translation of high-level programs into machine language, computer organization, caches, performance measurement, parallelism, CPU design, warehouse-scale computing, and related topics. Lecture Videos Lecture Notes Resources Old Exams CS 107 Computer Organization & Systems Stanford University CS107 is the third course in Stanford's introductory programming sequence. The course will work from the C programming language down to the microprocessor to de-mystify the machine. With a complete understanding of how computer systems execute programs and manipulate data, you will become a more effective programmer, especially in dealing with issues of debugging, performance, portability, and robustness. Lecture Videos Assignments CS 140 Operating Systems Stanford University This class introduces the basic facilities provided in modern operating systems. The course divides into three major sections. The first part of the course discusses concurrency. The second part of the course addresses the problem of memory management. The third major part of the course concerns file systems. Lecture Notes Assignments 6.004 Computation Structures MIT Introduces architecture of digital systems, emphasizing structural principles common to a wide range of technologies. Multilevel implementation strategies; definition of new primitives (e.g., gates, instructions, procedures, processes) and their mechanization using lower-level elements. Analysis of potential concurrency; precedence constraints and performance measures; pipelined and multidimensional systems. Instruction set design issues; architectural support for contemporary software structures. 4 Engineering Design Points. 6.004 offers an introduction to the engineering of digital systems. Starting with MOS transistors, the course develops of series of building blocks logic gates, combinational and sequential circuits, finite-state machines, computers and finally complete systems. Both hardware and software mechanisms are explored through a series of design examples. Youtube Playlist Lecture Notes Labs-Assignments CS 162 Operating Systems and Systems Programming UC Berkeley The purpose of this course is to teach the design of operating systems and operating systems concepts that appear in other advanced systems. Topics we will cover include concepts of operating systems, systems programming, networked and distributed systems, and storage systems, including multiple-program systems (processes, interprocess communication, and synchronization), memory allocation (segmentation, paging), resource allocation and scheduling, file systems, basic networking (sockets, layering, APIs, reliability), transactions, security, and privacy. Operating Systems course by the Chair of EECS, UC Berkeley David Culler Youtube Playlist Spring 2015 lectures Lecture Notes Spring 2015 lectures CS 168 Introduction to the Internet: Architecture and Protocols UC Berkeley This course is an introduction to the Internet architecture. We will focus on the concepts and fundamental design principles that have contributed to the Internet's scalability and robustness and survey the various protocols and algorithms used within this architecture. Topics include layering, addressing, intradomain routing, interdomain routing, reliable delivery, congestion control, and the core protocols (e.g., TCP, UDP, IP, DNS, and HTTP) and network technologies (e.g., Ethernet, wireless). Lecture Notes & Assignments Discussion Notes CS 179 GPU Programming Caltech This course will cover programming techniques for the GPU. The course will introduce NVIDIA's parallel computing language, CUDA. Beyond covering the CUDA programming model and syntax, the course will also discuss GPU architecture, high performance computing on GPUs, parallel algorithms, CUDA libraries, and applications of GPU computing. Assignments Lecture Notes CS 186 Introduction to Database Systems UC Berkeley In the project assignments in CS186, you will write a basic database management system called SimpleDB. For this project, you will focus on implementing the core modules required to access stored data on disk; in future projects, you will add support for various query processing operators, as well as transactions, locking, and concurrent queries. Youtube Playlist Lecture Notes Projects CS 241 Systems Programming (Spring 2016) Univ of Illinois, Urbana-Champaign System programming refers to writing code that tasks advantage of operating system support for programmers. This course is designed to introduce you to system programming. By the end of this course, you should be proficient at writing programs that take full advantage of operating system support. To be concrete, we need to fix an operating system and we need to choose a programming language for writing programs. We chose the C language running on a Linux/UNIX operating system (which implements the POSIX standard interface between the programmer and the OS). Assignments Labs Github Page Crowd Sourced Book CS 425 Distributed Systems Univ of Illinois, Urbana-Champaign Brilliant set of lectures and reading material covering fundamental concepts in distributed systems such as Vector clocks, Consensus and Paxos. This is the 2016 version by Prof Indranil Gupta. Lectures Assignments CS 452 Real-Time Programming University of Waterloo Write a real-time OS microkernel in C, and application code to operate a model train set in response to real-time sensor information. The communication with the train set runs at 2400 baud so it takes about 61 milliseconds to ask all of the sensors for data about the train's possible location. This makes it particularly challenging because a train can move about 3 centimeters in that time. One of the most challenging and time-consuming courses at the University of Waterloo. Assignments Lecture notes CS 2043 Unix Tools & Scripting Cornell University UNIX-like systems are increasingly being used on personal computers, mobile phones, web servers, and many other systems. They represent a wonderful family of programming environments useful both to computer scientists and to people in many other fields, such as computational biology and computational linguistics, in which data is naturally represented by strings. This course provides an intensive training to develop skills in Unix command line tools and scripting that enable the accomplishment and automation of large and challenging computing tasks. The syllabus takes students from shell basics and piping, to regular-expression processing tools, to shell scripting and Python. Syllabus Lectures Assignments CS 3410 Computer System Organization and Programming Cornell University CS3410 provides an introduction to computer organization, systems programming and the hardware/software interface. Topics include instruction sets, computer arithmetic, datapath design, data formats, addressing modes, memory hierarchies including caches and virtual memory, I/O devices, bus-based I/O systems, and multicore architectures. Students learn assembly language programming and design a pipelined RISC processor. Lectures Assignments CS 4410 Operating Systems Cornell University CS 4410 covers systems programming and introductory operating system design and implementation. We will cover the basics of operating systems, namely structure, concurrency, scheduling, synchronization, memory management, filesystems, security and networking. The course is open to any undergraduate who has mastered the material in CS3410/ECE3140. Syllabus Lectures CS 4414 Operating Systems University of Virginia A course (that) covers topics including: Analysis process communication and synchronization; resource management; virtual memory management algorithms; file systems; and networking and distributed systems. The primary goal of this course is to improve your ability to build scalable, robust and secure computing systems. It focuses on doing that by understanding what underlies the core abstractions of modern computer systems. Syllabus Lectures CS 5412 Cloud Computing Cornell University Taught by one of the stalwarts of this field, Prof Ken Birman, this course has a fantastic set of slides that one can go through. The Prof's book is also a gem and recommended as a must read in Google's tutorial on Distributed System Design Slides CSCE 3613 Operating Systems University of Arkansas (Fayetteville) - An introduction to operating systems including topics in system structures, process management, storage management, files, distributed systems, and case studies. Syllabus Assignments Lecture Notes Readings CSCI-UA.0202: Operating Systems (Undergrad) Operating Systems NYU NYU's operating system course. It's a fundamental course focusing basic ideas of operating systems, including memory management, process shceduling, file system, ect. It also includes some recomended reading materials. What's more, there are a series of hands-on lab materials, helping you easily understand OS. Assignments Lectures Old Exams CSCI 360 Computer Architecture 3 CUNY Hunter College A course that covers cache design, buses, memory hierarchies, processor-peripheral interfaces, and multiprocessors, including GPUs. CSCI 493.66 UNIX System Programming (formerly UNIX Tools) CUNY Hunter College A course that is mostly about writing programs against the UNIX API, covering all of the basic parts of the kernel interface and libraries, including files, processes, terminal control, signals, and threading. CSCI 493.75 Parallel Computing CUNY Hunter College The course is an introduction to parallel algorithms and parallel programming in C and C++, using the Message Passing Interface (MPI) and the OpenMP application programming interface. It also includes a brief introduction to parallel architectures and interconnection networks. It is both theoretical and practical, including material on design methodology, performance analysis, and mathematical concepts, as well as details on programming using MPI and OpenMP. Hack the Kernel Introduction to Operating Systems SUNY University at Buffalo, NY This course is an introduction to operating system design and implementation. We study operating systems because they are examples of mature and elegant solutions to a difficult design problem: how to safely and efficiently share system resources and provide abstractions useful to applications. For the processor, memory, and disks, we discuss how the operating system allocates each resource and explore the design and implementation of related abstractions. We also establish techniques for testing and improving system performance and introduce the idea of hardware virtualization. Programming assignments provide hands-on experience with implementing core operating system components in a realistic development environment. Course by Dr.Geoffrey Challen Syllabus Slides Video lectures Assignments Old Exams ECE 459 Programming for Performance University of Waterloo Learn techniques for profiling, rearchitecting, and implementing software systems that can handle industrial-sized inputs, and to design and build critical software infrastructure. Learn performance optimization through parallelization, multithreading, async I/O, vectorization and GPU programming, and distributed computing. Lecture slides PODC Principles of Distributed Computing ETH-Zurich Explore essential algorithmic ideas and lower bound techniques, basically the "pearls" of distributed computing in an easy-to-read set of lecture notes, combined with complete exercises and solutions. Book Assignments and Solutions SPAC Parallelism and Concurrency Univ of Washington Technically not a course nevertheless an awesome collection of materials used by Prof Dan Grossman to teach parallelism and concurrency concepts to sophomores at UWash 6.824 Distributed Systems MIT MIT's graduate-level DS course with a focus on fault tolerance, replication, and consistency, all taught via awesome lab assignments in Golang! Assignments - Just do git clone git://g.csail.mit.edu/6.824-golabs-2014 6.824 Readings 6.828 Operating Systems MIT MIT's operating systems course focusing on the fundamentals of OS design including booting, memory management, environments, file systems, multitasking, and more. In a series of lab assignments, you will build JOS, an OS exokernel written in C. Assignments Lectures Videos Note: These are student recorded cam videos of the 2011 course. The videos explain a lot of concepts required for the labs and assignments. CSEP 552 Distributed Systems University of Washington CSEP552 is a graduate course on distributed systems. Distributed systems have become central to many aspects of how computers are used, from web applications to e-commerce to content distribution. This course will cover abstractions and implementation techniques for the construction of distributed systems, including client server computing, the web, cloud computing, peer-to-peer systems, and distributed storage systems. Topics will include remote procedure call, maintaining consistency of distributed state, fault tolerance, high availability, and other topics. As we believe the best way to learn the material is to build it, there will be a series of hands-on programming projects. Lectures of a previous session are available to watch. 15-213 Introduction to Computer Systems (ICS) Carnegie-Mellon University The ICS course provides a programmer's view of how computer systems execute programs, store information, and communicate. It enables students to become more effective programmers, especially in dealing with issues of performance, portability and robustness. It also serves as a foundation for courses on compilers, networks, operating systems, and computer architecture, where a deeper understanding of systems-level issues is required. Topics covered include: machine-level code and its generation by optimizing compilers, performance evaluation and optimization, computer arithmetic, memory organization and management, networking technology and protocols, and supporting concurrent computation. This is the must-have course for everyone in CMU who wants to learn some computer science no matter what major are you in. Because it's CMU (The course number is as same as the zip code of CMU)! Lecture Notes Videos Assignments 15-418 Parallel Computer Architecture and Programming Carnegie-Mellon University The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design. Assignments Lecture Notes Lecture Videos Readings 15-440 Distributed Systems Carnegie-Mellon University Introduction to distributed systems with a focus on teaching concepts via projects implemented in the Go programming language. Assignments 15-721 Database Systems Carnegie-Mellon University This course is a comprehensive study of the internals of modern database management systems. It will cover the core concepts and fundamentals of the components that are used in both high-performance transaction processing systems (OLTP) and large-scale analytical systems (OLAP). The class will stress both efficiency and correctness of the implementation of these ideas. All class projects will be in the context of a real in-memory, multi-core database system. The course is appropriate for graduate students in software systems and for advanced undergraduates with strong systems programming skills. Assignments Lecture Videos Readings 15-749 Engineering Distributed Systems Carnegie-Mellon University A project focused course on Distributed Systems with an awesome list of readings Readings 18-447 Introduction to Computer Architecture CMU Very comprehensive material on Computer Architecture - definitely more than just "introduction". Online material is very user-friendly, even the recitation videos available online. This is the Spring'15 version by Prof. Onur Mutlu Lectures and Recitation Homeworks 7 HWs with answer set as well Readings Programming Languages / Compilers CS 75 Principles of Compiler Design Swathmore College Modelled after the influential paper on incremental approach to compiler design, this course teaches how to build a compiler in OCaml Course on Github Notes CS 91 Introduction to Programming Languages Swathmore College Uses the Pyret programming language & PAPL book to understand the fundamentals of programming languages. Labs CIS 194 Introduction to Haskell Penn Engineering Explore the joys of functional programming, using Haskell as a vehicle. The aim of the course will be to allow you to use Haskell to easily and conveniently write practical programs. Previous semester also available, with more exercises CIS 198 Rust Programming UPenn This course covers what makes Rust so unique and applies it to practical systems programming problems. Topics covered include traits and generics; memory safety (move semantics, borrowing, and lifetimes); Rust’s rich macro system; closures; and concurrency. Assignments Clojure Functional Programming with Clojure University of Helsinki The course is an introduction to functional programming with a dynamically typed language Clojure. We start with an introduction to Clojure; its syntax and development environment. Clojure has a good selection of data structures and we cover most of them. We also go through the basics of recursion and higher-order functions. The course material is in English. Github Page CMSC 430 Introduction to Compilers Univ of Maryland The goal of CMSC 430 is to arm students with the ability to design, implement, and extend a programming language. Throughout the course, students will design and implement several related languages, and will explore parsing, syntax querying, dataflow analysis, compilation to bytecode, type systems, and language interoperation. Lecture Notes Assignments COS 326 Functional Programming Princeton University Covers functional programming concepts like closures, tail-call recursion & parallelism using the OCaml programming language Lectures Assignments CS 143 Compiler construction Stanford University CS143 is a Stanford's course in the practical and theoretical aspects of compiler construction. Home Syllabus Lectures Assignments CS143 - 2011 CS 164 Hack your language! UC Berkeley Introduction to programming languages by designing and implementing domain-specific languages. Lecture Videos Code for Assignments CS 173 Programming Languages Brown University Course by Prof. Krishnamurthi (author of HtDP) and numerous other awesome books on programming languages. Uses a custom designed Pyret programming language to teach the concepts. There was an online class hosted in 2012, which includes all lecture videos for you to enjoy. Videos Assignments CS 223 Purely Functional Data Structures In Elm University of Chicago This course teaches functional reactive programming and purely functional data structures based on Chris Okazaki's book and using the Elm programming language. Lectures Assignments CS 240h Functional Systems in Haskell Stanford University Building software systems in Haskell Lecture Slides 3 Assignments: Lab1, Lab2, Lab3 CS 421 Programming Languages and Compilers Univ of Illinois, Urbana-Champaign Course that uses OCaml to teach functional programming and programming language design. Lectures Videos Assignments Exams CS 3110 Data Structures and Functional Programming Cornell University Another course that uses OCaml to teach alternative programming paradigms, especially functional and concurrent programming. Lecture Slides Assignments CS 4120 Introduction to Compilers Cornell University An introduction to the specification and implementation of modern compilers. Topics covered include lexical scanning, parsing, type checking, code generation and translation, an introduction to optimization, and compile-time and run-time support for modern programming languages. As part of the course, students build a working compiler for an object-oriented language. Syllabus Lectures Assignments CS 4400 Programming Languages Northeastern University This is a course on the study, design, and implementation of programming languages. The course works at two simultaneous levels: first, we will use a programming language that can demonstrate a wide variety of programming paradigms. Second, using this language, we will learn about the mechanics behind programming languages by implementing our own language(s). The two level approach usually means that we will often see how to use a certain feature, and continue by implementing it. Syllabus Lecture Notes/Resources CS 4610 Programming Languages and Compilers University of Virginia Course that uses OCaml to teach functional programming and programming language design. Each assignment is a part of an interpreter and compiler for an object-oriented language similar to Java, and you are required to use a different language for each assignment (i.e., choose 4 from Python, JS, OCaml, Haskell, Ruby). Lecture Notes Assignments CS 5114 Network Programming Languages Cornell University This course provides an introduction to the languages used to program computer networks. It will examine recent proposals based on logic, functional, and distributed languages, as well as tools for establishing correctness using automatic solvers, model checkers, and proof assistants. Syllabus Lectures CS 5142 Scripting Languages Cornell University Perl, PHP, JavaScript, VisualBasic -- they are often-requested skills for employment, but most of us do not have the time to find out what they are all about. In this course, you learn how to use scripting languages for rapid prototyping, web programming, data processing, and application extension. Besides covering traditional programming languages concepts as they apply to scripting (e.g., dynamic typing and scoping), this course looks at new concepts rarely found in traditional languages (e.g., string interpolation, hashes, and polylingual code). Through a series of small projects, you use different languages to achieve programming tasks that highlight the strengths and weaknesses of scripting. As a side effect, you practice teaching yourself new languages. Syllabus Lectures Assignments CS 5470 Compilers University of Utah If you're a fan of Prof Matt's writing on his fantastic blog you ought to give this a shot. The course covers the design and implementation of compilers, and it explores related topics such as interpreters, virtual machines and runtime systems. Aside from the Prof's witty take on cheating the page has tons of interesting links on programming languages, parsing and compilers. Lecture Notes Projects CS 6118 Types and Semantics Cornell University Types and Semantics is about designing and understand programming languages, whether they be domain specific or general purpose. The goal of this class is to provide a variety of tools for designing custom (programming) languages for whatever task is at hand. Part of that will be a variety of insights on how languages work along with experiences from working with academics and industry on creating new languages such as Ceylon and Kotlin. The class focuses on types and semantics and the interplay between them. This means category theory and constructive type theory (e.g. Coq and richer variations) are ancillary topics of the class. The class also covers unconventional semantic domains such as classical linear type theory in order to both break students from convential thinking and to provide powerful targets capable of formalizing thinks like networking protocols, resource-sensitive computation, and concurrency constructs. The class project is to design and formalize a (programming) language for a purpose of the student's choosing, and assignments are designed to ensure students have had a chance to practice applying the techniques learned in class before culminating these skills in the class project. Syllabus Lectures CSC 253 CPython internals: A ten-hour codewalk through the Python interpreter source code University of Rochester Nine lectures walking through the internals of CPython, the canonical Python interpreter implemented in C. They were from the Dynamic Languages and Software Development course taught in Fall 2014 at the University of Rochester. CSE 341 Programming Languages University of Washington Covers non-imperative paradigms and languages such as Ruby, Racket, and ML and the fundamentals of programming languages. Lectures and Videos Assignments and Tests CSE P 501 Compiler Construction University of Washington Teaches understanding of how a modern compiler is structured and the major algorithms that are used to translate code from high-level to machine language. The best way to do this is to actually build a working compiler, so there will be a significant project to implement one that translates programs written in a core subset of Java into executable x86 assembly language. The compilers themselves will use scanner and parser generator tools and the default implementation language is Java. Lectures Assignments, Tests, and Solutions DMFP Discrete Mathematics and Functional Programming Wheaton College A course that teaches discrete maths concepts with functional programming Lecture Videos Assignments PCPP Practical Concurrent and Parallel Programming IT University of Copenhagen In this MSc course you learn how to write correct and efficient concurrent and parallel software, primarily using Java, on standard shared-memory multicore hardware. The course covers basic mechanisms such as threads, locks and shared memory as well as more advanced mechanisms such as parallel streams for bulk data, transactional memory, message passing, and lock-free data structures with compare-and-swap. It covers concepts such as atomicity, safety, liveness and deadlock. It covers how to measure and understand performance and scalability of parallel programs. It covers tools and methods to find bugs in concurrent programs. 6.945 Adventures in Advanced Symbolic Programming MIT Taught by Gerald Sussman of SICP fame, this class deals with concepts and techniques for the design an implementation of large software systems that can be adapted to uses not anticipated by the designer. Applications include compilers, computer-algebra systems, deductive systems, and some artificial intelligence applications. Assignments: Extensive programming assignments, using MIT/GNU Scheme. Students should have significant programming experience in Scheme, Common Lisp, Haskell, CAML or other "functional" language. Readings CS 696 Functional Design and Programming San Diego State University Covers functional programming basis using Clojure. Topics include testing, functional programming, immutable collections and concurrency. Also includes assignments covering Clojurescript, [Reagent](Reagent Github) etc. L28 Advanced Functional Programming University of Cambridge This module aims to teach students how to use the features of modern typed functional programming languages (e.g. OCaml, Haskell) to design and implement libraries and DSLs. It aims to demonstrate how such techniques can improve both correctness and efficiency. Students wishing to take the module should have some experience of a typed functional programming language and an understanding of type inference. This particular session was taught by a prominent OCaml programmer, open Source contributor & author of real world OCaml - Dr Anil Madhavapeddy. Algorithms CS 61B Data Structures UC Berkeley In this course, you will study advanced programming techniques including data structures, encapsulation, abstract data types, interfaces, and algorithms for sorting and searching, and you will get a taste of “software engineering”—the design and implementation of large programs. Full Lecture Materials Lecture of Spring 2016. This website contains full matrials including video links, labs, homeworks, projects. Very good for self-learner. Also a good start for Java. And it includes some other usefull resources for Java Documentation, Data Structure Resources, Git/GitHub and Java Development Resources. Resources Labs The link to labs and projects is included in the website. Lecture Videos on Youtube The link to videos is included in the website. CS 97SI Introduction to Competitive Programming Stanford University Fantastic repository of theory and practice problems across various topics for students who are interested to participate in ACM-ICPC. Lectures and Assignments CS 224 Advanced Algorithms Harvard University CS 224 is an advanced course in algorithm design, and topics we will cover include the word RAM model, data structures, amortization, online algorithms, linear programming, semidefinite programming, approximation algorithms, hashing, randomized algorithms, fast exponential time algorithms, graph algorithms, and computational geometry. Lecture Videos (Youtube) Assignments CS 261 A Second Course in Algorithms Stanford University Algorithms for network optimization: max-flow, min-cost flow, matching, assignment, and min-cut problems. Introduction to linear programming. Use of LP duality for design and analysis of algorithms. Approximation algorithms for NP-complete problems such as Steiner Trees, Traveling Salesman, and scheduling problems. Randomized algorithms. Introduction to online algorithms. Lecture Notes, Videos & Assignments (Youtube) CS 473/573 Fundamental Algorithms Univ of Illinois, Urbana-Champaign Algorithms class covering recursion, randomization, amortization, graph algorithms, network flows and hardness. The lecture notes by Prof. Erikson are comprehensive enough to be a book by themselves. Highly recommended! Lecture Notes Labs and Exams CS 2150 Program & Data Representation University of Virginia This data structures course introduces C++, linked-lists, stacks, queues, trees, numerical representation, hash tables, priority queues, heaps, huffman coding, graphs, and x86 assembly. Lectures Assignments CS 4820 Introduction to Analysis of Algorithms Cornell University This course develops techniques used in the design and analysis of algorithms, with an emphasis on problems arising in computing applications. Example applications are drawn from systems and networks, artificial intelligence, computer vision, data mining, and computational biology. This course covers four major algorithm design techniques (greedy algorithms, divide and conquer, dynamic programming, and network flow), computability theory focusing on undecidability, computational complexity focusing on NP-completeness, and algorithmic techniques for intractable problems, including identification of structured special cases, approximation algorithms, and local search heuristics. Lectures Assignments Syllabus CSCI 104 Data Structures and Object Oriented Design University of Southern California (USC) Lectures Labs Assignments Additional Resources CSCI 135 Software Design and Analysis I CUNY Hunter College It is currently an intensive introduction to program development and problem solving. Its emphasis is on the process of designing, implementing, and evaluating small-scale programs. It is not supposed to be a C++ programming course, although much of the course is spent on the details of C++. C++ is an extremely large and complex programming language with many features that interact in unexpected ways. One does not need to know even half of the language to use it well. Lectures and Assignments CSCI 235 Software Design and Analysis II CUNY Hunter College Introduces algorithms for a few common problems such as sorting. Practically speaking, it furthers the students' programming skills with topics such as recursion, pointers, and exception handling, and provides a chance to improve software engineering skills and to give the students practical experience for more productive programming. Lectures and Assignments CSCI 335 Software Design and Analysis III CUNY Hunter College This includes the introduction of hashes, heaps, various forms of trees, and graphs. It also revisits recursion and the sorting problem from a higher perspective than was presented in the prequels. On top of this, it is intended to introduce methods of algorithmic analysis. Lectures and Assignments CSE 331 Software Design and Implementation University of Washington Explores concepts and techniques for design and construction of reliable and maintainable software systems in modern high-level languages; program structure and design; program-correctness approaches, including testing. Lectures, Assignments, and Exams CSE 373 Analysis of Algorithms Stony Brook University Prof Steven Skiena's no stranger to any student when it comes to algorithms. His seminal book has been touted by many to be best for getting that job in Google. In addition, he's also well-known for tutoring students in competitive programming competitions. If you're looking to brush up your knowledge on Algorithms, you can't go wrong with this course. Lecture Videos ECS 122A Algorithm Design and Analysis UC Davis Taught by Dan Gusfield in 2010, this course is an undergraduate introduction to algorithm design and analysis. It features traditional topics, such as Big Oh notation, as well as an importance on implementing specific algorithms. Also featured are sorting (in linear time), graph algorithms, depth-first search, string matching, dynamic programming, NP-completeness, approximation, and randomization. Syllabus Lecture Videos Assignments ECS 222A Graduate Level Algorithm Design and Analysis UC Davis This is the graduate level complement to the ECS 122A undergraduate algorithms course by Dan Gusfield in 2011. It assumes an undergrad course has already been taken in algorithms, and, while going over some undergraduate algorithms topics, focuses more on increasingly complex and advanced algorithms. Lecture Videos Syllabus Assignments 6.INT Hacking a Google Interview MIT This course taught in the MIT Independent Activities Period in 2009 goes over common solution to common interview questions for software engineer interviews at highly selective companies like Apple, Google, and Facebook. They cover time complexity, hash tables, binary search trees, and other common algorithm topics you should have already covered in a different course, but goes more in depth on things you wouldn't otherwise learn in class- like bitwise logic and problem solving tricks. Handouts Topics Covered 6.006 Introduction to Algorithms MIT This course provides an introduction to mathematical modeling of computational problems. It covers the common algorithms, algorithmic paradigms, and data structures used to solve these problems. The course emphasizes the relationship between algorithms and programming, and introduces basic performance measures and analysis techniques for these problems. This course provides an introduction to mathematical modeling of computational problems. It covers the common algorithms, algorithmic paradigms, and data structures used to solve these problems. The course emphasizes the relationship between algorithms and programming, and introduces basic performance measures and analysis techniques for these problems. Lecture Videos Assignments Readings Resources Old Exams 6.046J/18.410J Design and Analysis of Algorithms MIT This is an intermediate algorithms course with an emphasis on teaching techniques for the design and analysis of efficient algorithms, emphasizing methods of application. Topics include divide-and-conquer, randomization, dynamic programming, greedy algorithms, incremental improvement, complexity, and cryptography. This course assumes that students know how to analyze simple algorithms and data structures from having taken 6.006. It introduces students to the design of computer algorithms, as well as analysis of sophisticated algorithms. Lecture Videos Lecture Notes Assignments Resources Old Exams 6.851 Advanced Data Structures MIT This is an advanced DS course, you must be done with the Advanced Algorithms course before attempting this one. Lectures Contains videos from sp2012 version, but there isn't much difference. Assignments contains the calendar as well. 6.854/18.415J Advanced Algorithms MIT Advanced course in algorithms by Dr. David Karger covering topics such as amortization, randomization, fingerprinting, word-level parallelism, bit scaling, dynamic programming, network flow, linear programming, fixed-parameter algorithms, and approximation algorithms. Register on NB to access the problem set and lectures. 6.854J/18.415J Advanced Algorithms MIT This course is a first-year graduate course in algorithms. Emphasis is placed on fundamental algorithms and advanced methods of algorithmic design, analysis, and implementation. Techniques to be covered include amortization, randomization, fingerprinting, word-level parallelism, bit scaling, dynamic programming, network flow, linear programming, fixed-parameter algorithms, and approximation algorithms. Domains include string algorithms, network optimization, parallel algorithms, computational geometry, online algorithms, external memory, cache, and streaming algorithms, and data structures. The need for efficient algorithms arises in nearly every area of computer science. But the type of problem to be solved, the notion of what algorithms are "efficient,'' and even the model of computation can vary widely from area to area. In this second class in algorithms, we will survey many of the techniques that apply broadly in the design of efficient algorithms, and study their application in a wide range of application domains and computational models. The goal is for the class to be broad rather than deep. Our plan is to touch upon the following areas. This is a tentative list of topics that might be covered in the class; we will select material adaptively based on the background, interests, and rate of progress of the students. Lecture Videos - Spring 2016 Lecture Notes Assignments Readings Resources 15-451/651 Algorithms Carnegie Mellon University The required algorithms class that go in depth into all basic algorithms and the proofs behind them. This is one of the heavier algorithms curriculums on this page. Taught by Avrim Blum and Manuel Blum who has a Turing Award due to his contributions to algorithms. Course link includes a very comprehensive set of reference notes by Avrim Blum. 16s-4102 Algorithms University of Virginia Lecture Videos & Homeworks (Youtube) CS Theory CIS 500 Software Foundations University of Pennsylvania An introduction to formal verification of software using the Coq proof assistant. Topics include basic concepts of logic, computer-assisted theorem proving, functional programming, operational semantics, Hoare logic, and static type systems. Lectures and Assignments Textbook CS 103 Mathematical Foundations of Computing Stanford University CS103 is a first course in discrete math, computability theory, and complexity theory. In this course, we'll probe the limits of computer power, explore why some problems are harder to solve than others, and see how to reason with mathematical certainty. Links to all lectures notes and assignments are directly on the course page CS 173 Discrete Structures Univ of Illinois Urbana-Champaign This course is an introduction to the theoretical side of computer science. In it, you will learn how to construct proofs, read and write literate formal mathematics, get a quick introduction to key theory topics and become familiar with a range of standard mathematics concepts commonly used in computer science. Textbook Written by the professor. Includes Instructor's Guide. Assignments Exams CS 276 Foundations of Cryptography UC Berkeley This course discusses the complexity-theory foundations of modern cryptography, and looks at recent results in the field such as Fully Homomorphic Encryption, Indistinguishability Obfuscation, MPC and so on. CS 278 Complexity Theory UC Berkeley A graduate level course on complexity theory that introduces P vs NP, the power of randomness, average-case complexity, hardness of approximation, and so on. CS 374 Algorithms & Models of Computation (Fall 2014) University of Illinois Urbana-Champaign CS 498 section 374 (unofficially "CS 374") covers fundamental tools and techniques from theoretical computer science, including design and analysis of algorithms, formal languages and automata, computability, and complexity. Specific topics include regular and context-free languages, finite-state automata, recursive algorithms (including divide and conquer, backtracking, dynamic programming, and greedy algorithms), fundamental graph algorithms (including depth- and breadth-first search, topological sorting, minimum spanning trees, and shortest paths), undecidability, and NP-completeness. The course also has a strong focus on clear technical communication. Assignments/Exams Lecture Notes/Labs Lecture videos CS 3110 Data Structures and Functional Programming Cornell University CS 3110 (formerly CS 312) is the third programming course in the Computer Science curriculum, following CS 1110/1112 and CS 2110. The goal of the course is to help students become excellent programmers and software designers who can design and implement software that is elegant, efficient, and correct, and whose code can be maintained and reused. Syllabus Lectures Assignments CS 3220 Introduction to Scientific Computing Cornell University In this one-semester survey course, we introduce numerical methods for solving linear and nonlinear equations, interpolating data, computing integrals, and solving differential equations, and we describe how to use these tools wisely (we hope!) when solving scientific problems. Syllabus Lectures Assignments CS 4300 Information Retrieval Cornell University Studies the methods used to search for and discover information in large-scale systems. The emphasis is on information retrieval applied to textual materials, but there is some discussion of other formats.The course includes techniques for searching, browsing, and filtering information and the use of classification systems and thesauruses. The techniques are illustrated with examples from web searching and digital libraries. Syllabus Lectures Assignments CS 4810 Introduction to Theory of Computing Cornell University This undergraduate course provides a broad introduction to the mathematical foundations of computer science. We will examine basic computational models, especially Turing machines. The goal is to understand what problems can or cannot be solved in these models. Syllabus Lectures Assignments CS 6810 Theory of Computing Cornell University This graduate course gives a broad introduction to complexity theory, including classical results and recent developments. Complexity theory aims to understand the power of efficient computation (when computational resources like time and space are limited). Many compelling conceptual questions arise in this context. Most of these questions are (surprisingly?) difficult and far from being resolved. Nevertheless, a lot of progress has been made toward understanding them (and also why they are difficult). We will learn about these advances in this course. A theme will be combinatorial constructions with random-like properties, e.g., expander graphs and error-correcting codes. Some examples: Is finding a solution inherently more difficult than verifying it? Do more computational resources mean more computing power? Is it easier to find approximate solutions than exact ones? Are randomized algorithms more powerful than deterministic ones? Is it easier to solve problems in the average case than in the worst case? Are quantum computers more powerful than classical ones? Syllabus Lectures Assignments CSCE 3193 Programming Paradigms University of Arkansas (Fayetteville) Programming in different paradigms with emphasis on object oriented programming, network programming and functional programming. Survey of programming languages, event driven programming, concurrency, software validation. Syllabus Notes Assignments Practice Exams 6.045 Great Ideas in Theoretical Computer Science MIT This course provides a challenging introduction to some of the central ideas of theoretical computer science. Beginning in antiquity, the course will progress through finite automata, circuits and decision trees, Turing machines and computability, efficient algorithms and reducibility, the P versus NP problem, NP-completeness, the power of randomness, cryptography and one-way functions, computational learning theory, and quantum computing. It examines the classes of problems that can and cannot be solved by various kinds of machines. It tries to explain the key differences between computational models that affect their power. Syllabus Lecture Notes Lecture Videos Introduction to CS CS 10 The Beauty and Joy of Computing UC Berkeley CS10 is UCB's introductory computer science class, taught using the beginners' drag-and-drop language. Students learn about history, social implications, great principles, and future of computing. They also learn the joy of programming a computer using a friendly, graphical language, and will complete a substantial team programming project related to their interests. Snap*!* (based on Scratch by MIT). Curriculum CS 50 Introduction to Computer Science Harvard University CS50x is Harvard College's introduction to the intellectual enterprises of computer science and the art of programming for majors and non-majors alike, with or without prior programming experience. An entry-level course taught by David J. Malan. Lectures Problem Sets The course can also be taken from edX. CS 61A Structure and Interpretation of Computer Programs [Python] UC Berkeley In CS 61A, we are interested in teaching you about programming, not about how to use one particular programming language. We consider a series of techniques for controlling program complexity, such as functional programming, data abstraction, and object-oriented programming. Mastery of a particular programming language is a very useful side effect of studying these general techniques. However, our hope is that once you have learned the essence of programming, you will find that picking up a new programming language is but a few days' work. Lecture Resources by Type Lecture Resources by Topic Additional Resources Practice Problems Extra Lectures CS 61AS Structure & Interpretation of Computer Programs [Racket] UC Berkeley A self-paced version of the CS61 Course but in Racket / Scheme. 61AS is a great introductory course that will ease you into all the amazing concepts that future CS courses will cover, so remember to keep an open mind, have fun, and always respect the data abstraction Lecture Videos Assignments and Notes CS 101 Computer Science 101 Stanford University CS101 teaches the essential ideas of Computer Science for a zero-prior-experience audience. Participants play and experiment with short bits of "computer code" to bring to life to the power and limitations of computers. Lectures videos will available for free after registration. CS 106A Programming Methodology Stanford University This course is the largest of the introductory programming courses and is one of the largest courses at Stanford. Topics focus on the introduction to the engineering of computer applications emphasizing modern software engineering principles: object-oriented design, decomposition, encapsulation, abstraction, and testing. Programming Methodology teaches the widely-used Java programming language along with good software engineering principles. Lecture Videos Assignments All materials in a zip file CS 106B Programming Abstractions Stanford University This course is the natural successor to Programming Methodology and covers such advanced programming topics as recursion, algorithmic analysis, and data abstraction using the C++ programming language, which is similar to both C and Java. Lectures Assignments All materials in a zip file CS 107 Programming Paradigms Stanford University Topics: Advanced memory management features of C and C++; the differences between imperative and object-oriented paradigms. The functional paradigm (using LISP) and concurrent programming (using C and C++) Lectures Assignments [CS 109] (http://otfried.org/courses/cs109/index.html) Programming Practice Using Scala KAIST This course introduces basic concepts of programming and computer science, such as dynamic and static typing, dynamic memory allocation, objects and methods, binary representation of numbers, using an editor and compiler from the command line, running programs with arguments from the command line, using libraries, and the use of basic data structures such as arrays, lists, sets, and maps. We will use Scala for this course. [Lectures] (http://otfried.org/courses/cs109/index.html) [Assignments] (http://otfried.org/courses/cs109/index.html) CS 1109 Fundamental Programming Concepts Cornell University This course provides an introduction to programming and problem solving using a high-level programming language. It is designed to increase your knowledge level to comfortably continue to courses CS111x. Our focus will be on generic programming concepts: variables, expressions, control structures, loops, arrays, functions, pseudocode and algorithms. You will learn how to analyze problems and convert your ideas into solutions interpretable by computers. We will use MATLAB; because it provides a productive environment, and it is widely used by all engineering communities. Syllabus Lectures Assignments CS 1110 Introduction to Computing Using Python Cornell University Programming and problem solving using Python. Emphasizes principles of software development, style, and testing. Topics include procedures and functions, iteration, recursion, arrays and vectors, strings, an operational model of procedure and function calls, algorithms, exceptions, object-oriented programming, and GUIs (graphical user interfaces). Weekly labs provide guided practice on the computer, with staff present to help. Assignments use graphics and GUIs to help develop fluency and understanding. Syllabus Lectures Assignments CS 1112 Introduction to Computing Using Matlab Cornell University Programming and problem solving using MATLAB. Emphasizes the systematic development of algorithms and programs. Topics include iteration, functions, arrays and vectors, strings, recursion, algorithms, object-oriented programming, and MATLAB graphics. Assignments are designed to build an appreciation for complexity, dimension, fuzzy data, inexact arithmetic, randomness, simulation, and the role of approximation. NO programming experience is necessary; some knowledge of Calculus is required. Syllabus Lectures Assignments Projects CS 1115 Introduction to Computational Science and Engineering Using Matlab Graphical User Interfaces Cornell University Programming and problem solving using MATLAB. Emphasizes the systematic development of algorithms and programs. Topics include iteration, functions, arrays and vectors, strings, recursion, algorithms, object-oriented programming, and MATLAB graphics. Assignments are designed to build an appreciation for complexity, dimension, fuzzy data, inexact arithmetic, randomness, simulation, and the role of approximation. NO programming experience is necessary; some knowledge of Calculus is required. Syllabus Lectures Projects CS 1130 Transition to OO Programming Cornell University Introduction to object-oriented concepts using Java. Assumes programming knowledge in a language like MATLAB, C, C++, or Fortran. Students who have learned Java but were not exposed heavily to OO programming are welcome. Syllabus Lectures Assignments CS 1133 Transition to Python Cornell University Introduction to the Python programming language. Covers the basic programming constructs of Python, including assignment, conditionals, iteration, functions, object-oriented design, arrays, and vectorized computation. Assumes programming knowledge in a language like Java, Matlab, C, C++, or Fortran. Syllabus Lectures Assignments CS 1410-2 and CS2420-20 Computer Science I and II for Hackers University of Utah An intro course in the spirit of SICP designed by Professor Matthew Flatt (one of the lead designers of Racket and author of HtDP). Mostly Racket and C, and a bit of Java, with explanations on how high level functional programming concepts relate to the design of OOP programs. Do this one before SICP if SICP is a bit too much... Lectures and Assignments 1 Lectures and Assignments 2 Textbook Racket Language CS 2110 Object-Oriented Programming and Data Structures Cornell University CS 2110 is an intermediate-level programming course and an introduction to computer science. Topics include program design and development, debugging and testing, object-oriented programming, proofs of correctness, complexity analysis, recursion, commonly used data structures, graph algorithms, and abstract data types. Java is the principal programming language. The course syllabus can easily be extracted by looking at the link to lectures. Syllabus Lectures Assignments CS 4302 Web Information Systems Cornell University This course will introduce you to technologies for building data-centric information systems on the World Wide Web, show the practical applications of such systems, and discuss their design and their social and policy context by examining cross-cutting issues such as citizen science, data journalism and open government. Course work involves lectures and readings as well as weekly homework assignments, and a semester-long project in which the students demonstrate their expertise in building data-centric Web information systems. Syllabus Lectures Assignments CSCE 2004 Programming Foundations I University of Arkansas (Fayetteville) Introductory course for students majoring in computer science or computer engineering. Software development process: problem specification, program design, implementation, testing and documentation. Programming topics: data representation, conditional and iterative statements, functions, arrays, strings, file I/O, and classes. Using C++ in a UNIX environment. Syllabus Notes Assignments Practice Exams CSCI E-1 Understanding Computers and the Internet Harvard University Extension College This course is all about understanding: understanding what's going on inside your computer when you flip on the switch, why tech support has you constantly rebooting your computer, how everything you do on the Internet can be watched by others, and how your computer can become infected with a worm just by being turned on. Designed for students who use computers and the Internet every day but don't fully understand how it all works, this course fills in the gaps. Through lectures on hardware, software, the Internet, multimedia, security, privacy, website development, programming, and more, this course "takes the hood off" of computers and the Internet so that students understand how it all works and why. Through discussions of current events, students are exposed also to the latest technologies. Lecture Videos Syllabus Notes / Recaps Assignments CS-for-all CS for All Harvey Mudd College This book (and course) takes a unique approach to “Intro CS.” In a nutshell, our objective is to provide an introduction to computer science as an intellectually rich and vibrant field rather than focusing exclusively on computer programming. While programming is certainly an important and pervasive element of our approach, we emphasize concepts and problem-solving over syntax and programming language features. Lectures and Other resources 6.001 Structure and Interpretation of Computer Programs MIT Teaches big-picture computing concepts using the Scheme programming language. Students will implement programs in a variety of different programming paradigms (functional, object-oriented, logical). Heavy emphasis on function composition, code-as-data, control abstraction with continuations, and syntactic abstraction through macros. An excellent course if you are looking to build a mental framework on which to hang your programming knowledge. Lectures Textbook (epub, pdf) IDE 6.005 Software Construction, Fall 2016 MIT This course introduces fundamental principles and techniques of software development. Students learn how to write software that is safe from bugs, easy to understand, and ready for change. Topics include specifications and invariants; testing, test-case generation, and coverage; state machines; abstract data types and representation independence; design patterns for object-oriented programming; concurrent programming, including message passing and shared concurrency, and defending against races and deadlock; and functional programming with immutable data and higher-order functions. Lectures Notes/Assignments Machine Learning DEEPNLP Deep Learning for Natural Language Processing University of Oxford This is an applied course focussing on recent advances in analysing and generating speech and text using recurrent neural networks. We introduce the mathematical definitions of the relevant machine learning models and derive their associated optimisation algorithms. The course covers a range of applications of neural networks in NLP including analysing latent dimensions in text, transcribing speech to text, translating between languages, and answering questions. This course is organised by Phil Blunsom and delivered in partnership with the DeepMind Natural Language Research Group. Lectures Assignments are available on the organisation page titled as "practicals" CS20si Tensorflow for Deep Learning Research Stanford University This course will cover the fundamentals and contemporary usage of the Tensorflow library for deep learning research. We aim to help students understand the graphical computational model of Tensorflow, explore the functions it has to offer, and learn how to build and structure models best suited for a deep learning project. Through the course, students will use Tensorflow to build models of different complexity, from simple linear/logistic regression to convolutional neural network and recurrent neural networks with LSTM to solve tasks such as word embeddings, translation, optical character recognition. Students will also learn best practices to structure a model and manage research experiments. Assignments available on Github. COMS 4771 Machine Learning Columbia University Course taught by Tony Jebara introduces topics in Machine Learning for both generative and discriminative estimation. Material will include least squares methods, Gaussian distributions, linear classification, linear regression, maximum likelihood, exponential family distributions, Bayesian networks, Bayesian inference, mixture models, the EM algorithm, graphical models, hidden Markov models, support vector machines, and kernel methods. Lectures and Assignments CS 109 Data Science Harvard University Learning from data in order to gain useful predictions and insights. This course introduces methods for five key facets of an investigation: data wrangling, cleaning, and sampling to get a suitable data set; data management to be able to access big data quickly and reliably; exploratory data analysis to generate hypotheses and intuition; prediction based on statistical methods such as regression and classification; and communication of results through visualization, stories, and interpretable summaries. Lectures Slides Labs and Assignments 2014 Lectures 2013 Lectures (slightly better) CS 156 Learning from Data Caltech This is an introductory course in machine learning (ML) that covers the basic theory, algorithms, and applications. ML is a key technology in Big Data, and in many financial, medical, commercial, and scientific applications. It enables computational systems to adaptively improve their performance with experience accumulated from the observed data. ML has become one of the hottest fields of study today, taken up by undergraduate and graduate students from 15 different majors at Caltech. This course balances theory and practice, and covers the mathematical as well as the heuristic aspects. Lectures Homework Textbook CS 224d Deep Learning for Natural Language Processing Stanford University Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. Syllabus Lectures and Assignments CS 229r Algorithms for Big Data Harvard University Big data is data so large that it does not fit in the main memory of a single machine, and the need to process big data by efficient algorithms arises in Internet search, network traffic monitoring, machine learning, scientific computing, signal processing, and several other areas. This course will cover mathematically rigorous models for developing such algorithms, as well as some provable limitations of algorithms operating in those models. Lectures (Youtube) Assignments CS 231n Convolutional Neural Networks for Visual Recognition Stanford University Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. This course is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Lecture Notes Lecture Videos Github Page CS 287 Advanced Robotics UC Berkeley The course introduces the math and algorithms underneath state-of-the-art robotic systems. The majority of these techniques are heavily based on probabilistic reasoning and optimization---two areas with wide applicability in modern Artificial Intelligence. An intended side-effect of the course is to generally strengthen your expertise in these two areas. Lectures Notes Assignments CS 395T Statistical and Discrete Methods for Scientific Computing University of Texas Practical course in applying modern statistical techniques to real data, particularly bioinformatic data and large data sets. The emphasis is on efficient computation and concise coding, mostly in MATLAB and C++. Topics covered include probability theory and Bayesian inference; univariate distributions; Central Limit Theorem; generation of random deviates; tail (p-value) tests; multiple hypothesis correction; empirical distributions; model fitting; error estimation; contingency tables; multivariate normal distributions; phylogenetic clustering; Gaussian mixture models; EM methods; maximum likelihood estimation; Markov Chain Monte Carlo; principal component analysis; dynamic programming; hidden Markov models; performance measures for classifiers; support vector machines; Wiener filtering; wavelets; multidimensional interpolation; information theory. Lectures and Assignments CS 4780 Machine Learning Cornell University This course will introduce you to technologies for building data-centric information systems on the World Wide Web, show the practical applications of such systems, and discuss their design and their social and policy context by examining cross-cutting issues such as citizen science, data journalism and open government. Course work involves lectures and readings as well as weekly homework assignments, and a semester-long project in which the students demonstrate their expertise in building data-centric Web information systems. Syllabus Lectures CS 4786 Machine Learning for Data Science Cornell University An introductory course in machine learning, with a focus on data modeling and related methods and learning algorithms for data sciences. Tentative topic list: Dimensionality reduction, such as principal component analysis (PCA) and the singular value decomposition (SVD), canonical correlation analysis (CCA), independent component analysis (ICA), compressed sensing, random projection, the information bottleneck. (We expect to cover some, but probably not all, of these topics). Clustering, such as k-means, Gaussian mixture models, the expectation-maximization (EM) algorithm, link-based clustering. (We do not expect to cover hierarchical or spectral clustering.). Probabilistic-modeling topics such as graphical models, latent-variable models, inference (e.g., belief propagation), parameter learning. Regression will be covered if time permits. Assignments Lectures CVX 101 Convex Optimization Stanford University The course concentrates on recognizing and solving convex optimization problems that arise in applications. Topics addressed include the following. Convex sets, functions, and optimization problems. Basics of convex analysis. Least-squares, linear and quadratic programs, semidefinite programming, minimax, extremal volume, and other problems. Optimality conditions, duality theory, theorems of alternative, and applications. Interior-point methods. Applications to signal processing, statistics and machine learning, control and mechanical engineering, digital and analog circuit design, and finance. Textbook Lectures and Assignments DS-GA 1008 Deep Learning New York University This increasingly popular course is taught through the Data Science Center at NYU. Originally introduced by Yann Lecun, it is now led by Zaid Harchaoui, although Prof. Lecun is rumored to still stop by from time to time. It covers the theory, technique, and tricks that are used to achieve very high accuracy for machine learning tasks in computer vision and natural language processing. The assignments are in Lua and hosted on Kaggle. Course Page Recorded Lectures EECS E6893 & EECS E6895 Big Data Analytics & Advanced Big Data Analytics Columbia University Students will gain knowledge on analyzing Big Data. It serves as an introductory course for graduate students who are expecting to face Big Data storage, processing, analysis, visualization, and application issues on both workplaces and research environments. Taught by Dr. Ching-Yung Lin Course Site Assignments - Assignments are present in the Course Slides EECS E6894 Deep Learning for Computer Vision and Natural Language Processing Columbia University This graduate level research class focuses on deep learning techniques for vision and natural language processing problems. It gives an overview of the various deep learning models and techniques, and surveys recent advances in the related fields. This course uses Theano as the main programming tool. GPU programming experiences are preferred although not required. Frequent paper presentations and a heavy programming workload are expected. Readings Assignments Lecture Notes EE103 Introduction to Matrix Methods Stanford University The course covers the basics of matrices and vectors, solving linear equations, least-squares methods, and many applications. It'll cover the mathematics, but the focus will be on using matrix methods in applications such as tomography, image processing, data fitting, time series prediction, finance, and many others. EE103 is based on a book that Stephen Boyd and Lieven Vandenberghe are currently writing. Students will use a new language called Julia to do computations with matrices and vectors. Lectures Book Assignments Code Info 290 Analyzing Big Data with Twitter UC Berkeley school of information In this course, UC Berkeley professors and Twitter engineers provide lectures on the most cutting-edge algorithms and software tools for data analytics as applied to Twitter's data. Topics include applied natural language processing algorithms such as sentiment analysis, large scale anomaly detection, real-time search, information diffusion and outbreak detection, trend detection in social streams, recommendation algorithms, and advanced frameworks for distributed computing. Lecture Videos Previous Years coursepage Machine Learning: 2014-2015 University of Oxford The course focusses on neural networks and uses the Torch deep learning library (implemented in Lua) for exercises and assignments. Topics include: logistic regression, back-propagation, convolutional neural networks, max-margin learning, siamese networks, recurrent neural networks, LSTMs, hand-writing with recurrent neural networks, variational autoencoders and image generation and reinforcement learning Lectures and Assignments Source code StatLearning Intro to Statistical Learning Stanford University This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. The lectures cover all the material in An Introduction to Statistical Learning, with Applications in R which is a more approachable version of the Elements of Statistical Learning (or ESL) book. 10-601 Machine Learning Carnegie Mellon University This course covers the theory and practical algorithms for machine learning from a variety of perspectives. It covers topics such as Bayesian networks, decision tree learning, Support Vector Machines, statistical learning methods, unsupervised learning and reinforcement learning. The course covers theoretical concepts such as inductive bias, the PAC learning framework, Bayesian learning methods, margin-based learning, and Occam's Razor. Short programming assignments include hands-on experiments with various learning algorithms. This course is designed to give a graduate-level student a thorough grounding in the methodologies, technologies, mathematics and algorithms currently needed by people who do research in machine learning. Taught by one of the leading experts on Machine Learning - Tom Mitchell Lectures Project Ideas and Datasets 10-708 Probabilistic Graphical Models Carnegie Mellon University Many of the problems in artificial intelligence, statistics, computer systems, computer vision, natural language processing, and computational biology, among many other fields, can be viewed as the search for a coherent global conclusion from local information. The probabilistic graphical models framework provides a unified view for this wide range of problems, enabling efficient inference, decision-making and learning in problems with a very large number of attributes and huge datasets. This graduate-level course will provide you with a strong foundation for both applying graphical models to complex problems and for addressing core research topics in graphical models. Lecture Videos Assignments Lecture notes Readings 11-785 Deep Learning Carnegie Mellon University The course presents the subject through a series of seminars and labs, which will explore it from its early beginnings, and work themselves to some of the state of the art. The seminars will cover the basics of deep learning and the underlying theory, as well as the breadth of application areas to which it has been applied, as well as the latest issues on learning from very large amounts of data. We will concentrate largely, although not entirely, on the connectionist architectures that are most commonly associated with it. Lectures and Reading Notes are available on the page. CS246 Mining Massive Data Sets Stanford University The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data. Lecture Videos Assignments Lecture notes Readings CS276 Information Retrieval and Web Search Stanford University Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining. Lecture notes Readings Practical_RL Reinforcement Learning in the Wild Yandex SDA A course on reinforcement learning in the wild. Taught on-campus in HSE and Yandex SDA (russian) and maintained to be friendly to online students (both english and russian). Syllabus Security CIS 4930 / CIS 5930 Offensive Computer Security Florida State University Course taught by W. Owen Redwood and Xiuwen Liu. It covers a wide range of computer security topics, starting from Secure C Coding and Reverse Engineering to Penetration Testing, Exploitation and Web Application Hacking, both from the defensive and the offensive point of view. Lectures and Videos Assignments CS 155 Computer and Network Security Stanford Principles of computer systems security. Attack techniques and how to defend against them. Topics include: network attacks and defenses, operating system holes, application security (web, email, databases), viruses, social engineering attacks, privacy, and digital rights management. Course projects focus on building reliable code. Recommended: Basic Unix. Primarily intended for seniors and first-year graduate students. CS 161 Computer Security UC Berkeley Introduction to computer security. Cryptography, including encryption, authentication, hash functions, cryptographic protocols, and applications. Operating system security, access control. Network security, firewalls, viruses, and worms. Software security, defensive programming, and language-based security. Case studies from real-world systems. CS 259 Security Modeling and Analysis Stanford The course will cover a variety of contemporary network protocols and other systems with security properties. The course goal is to give students hands-on experience in using automated tools and related techniques to analyze and evaluate security mechanisms. To understand security properties and requirements, we will look at several network protocols and their properties, including secrecy, authentication, key establishment, and fairness. In parallel, the course will look at several models and tools used in security analysis and examine their advantages and limitations. In addition to fully automated finite-state model checking techniques, we will also study other approaches, such as constraint solving, process algebras, protocol logics, probabilistic model checking, game theory, and executable models based on logic programming. CS 261 Internet/Network Security UC Berkeley This class aims to provide a thorough grounding in network security suitable for those interested in conducting research in the area, as well as students more generally interested in either security or networking. We will also look at broader issues relating to Internet security for which networking plays a role. Topics include: denial-of-service; capabilities; network intrusion detection; worms; forensics; scanning; traffic analysis / inferring activity; architecture; protocol issues; legality and ethics; web attacks; anonymity; honeypots; botnets; spam; the underground economy; research pitfalls. The course is taught with an emphasis on seminal papers rather than bleeding-edge for a given topic. CS 5430 System Security Cornell University This course discusses security for computers and networked information systems. We focus on abstractions, principles, and defenses for implementing military as well as commercial-grade secure systems. Syllabus Lectures Assignments CSCI 4968 Modern Binary Exploitation Rensselaer Polytechnic Institute This repository contains the materials as developed and used by RPISEC to teach Modern Binary Exploitation at Rensselaer Polytechnic Institute in Spring 2015. This was a university course developed and run solely by students to teach skills in vulnerability research, reverse engineering, and binary exploitation. Lectures Notes Labs Projects CSCI 4976 Malware Analysis Rensselaer Polytechnic Institute This repository contains the materials as developed and used by RPISEC to teach Malware Analysis at Rensselaer Polytechnic Institute in Fall 2015. This was a university course developed and run soley by students, primarily using the EECS 588 Computer & Network Security University of Michigan Taught by J. Alex Halderman who has analyzed the security of Electronic Voting Machines in the US and over seas. This intensive research seminar covers foundational work and current topics in computer systems security. Readings Practical Malware Analysis book by Michael Sikorski and Andrew Honig, to teach skills in reverse engineering, malicious behaviour, malware, and anti-analysis techniques. Lectures Notes Labs Projects 6.857 Computer and Network Security MIT Emphasis on applied cryptography and may include: basic notion of systems security, cryptographic hash functions, symmetric cryptography (one-time pad, stream ciphers, block ciphers), cryptanalysis, secret-sharing, authentication codes, public-key cryptography (encryption, digital signatures), public-key attacks, web browser security, biometrics, electronic cash, viruses, electronic voting, Assignments include a group final project. Topics may vary year to year.Lecture Notes References 6.858 Computer Systems Security MIT Design and implementation of secure computer systems. Lectures cover threat models, attacks that compromise security, and techniques for achieving security, based on recent research papers. Topics include operating system (OS) security, capabilities, information flow control, language security, network protocols, hardware security, and security in web applications. Taught by James Mickens and Nickolai Zeldovich Video Lectures and Labs Quizzes Readings Final Projects 18-636 Browser Security Stanford The Web continues to grow in popularity as platform for retail transactions, financial services, and rapidly evolving forms of communication. It is becoming an increasingly attractive target for attackers who wish to compromise users' systems or steal data from other sites. Browser vendors must stay ahead of these attacks by providing features that support secure web applications. This course will study vulnerabilities in existing web browsers and the applications they render, as well as new technologies that enable web applications that were never before possible. The material will be largely based on current research problems, and students will be expected to criticize and improve existing defenses. Topics of study include (but are not limited to) browser encryption, JavaScript security, plug-in security, sandboxing, web mashups, and authentication. Artificial Intelligence CS 188 Introduction to Artificial Intelligence UC Berkeley This course will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision-theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in fully informed, partially observable and adversarial settings. Your agents will draw inferences in uncertain environments and optimize actions for arbitrary reward structures. Your machine learning algorithms will classify handwritten digits and photographs. The techniques you learn in this course apply to a wide variety of artificial intelligence problems and will serve as the foundation for further study in any application area you choose to pursue. Lectures Projects Exams CS 4700 Foundations of Artificial Intelligence Cornell University This course will provide an introduction to computer vision, with topics including image formation, feature detection, motion estimation, image mosaics, 3D shape reconstruction, and object and face detection and recognition. Applications of these techniques include building 3D maps, creating virtual characters, organizing photo and video databases, human computer interaction, video surveillance, automatic vehicle navigation, and mobile computer vision. This is a project-based course, in which you will implement several computer vision algorithms throughout the semester. Assignments Lectures CS 6700 Advanced Artificial Intelligence Cornell University The design of systems that are among top 10 performers in the world (human, computer, or hybrid human-computer). Syllabus Lectures Readings 6.868J The Society of Mind MIT This course is an introduction, by Prof. Marvin Minsky, to the theory that tries to explain how minds are made from collections of simpler processes. It treats such aspects of thinking as vision, language, learning, reasoning, memory, consciousness, ideals, emotions, and personality. It incorporates ideas from psychology, artificial intelligence, and computer science to resolve theoretical issues such as wholes vs. parts, structural vs. functional descriptions, declarative vs. procedural representations, symbolic vs. connectionist models, and logical vs. common-sense theories of learning. Lectures Assignments Readings Computer Graphics CAP 5415 Computer Vision University of Central Florida An introductory level course covering the basic topics of computer vision, and introducing some fundamental approaches for computer vision research. Lectures and Videos Assignments CIS 581 Computer Vision and Computational Photography University of Pennsylvania An introductory course in computer vision and computational photography focusing on four topics: image features, image morphing, shape matching, and image search. Lectures Assignments CMU 462 Computer Graphics Carnegie Mellon University This course provides a comprehensive introduction to computer graphics. Focuses on fundamental concepts and techniques, and their cross-cutting relationship to multiple problem domains in graphics (rendering, animation, geometry, imaging). Topics include: sampling, aliasing, interpolation, rasterization, geometric transformations, parameterization, visibility, compositing, filtering, convolution, curves & surfaces, geometric data structures, subdivision, meshing, spatial hierarchies, ray tracing, radiometry, reflectance, light fields, geometric optics, Monte Carlo rendering, importance sampling, camera models, high-performance ray tracing, differential equations, time integration, numerical differentiation, physically-based animation, optimization, numerical linear algebra, inverse kinematics, Fourier methods, data fitting, example-based synthesis. Lectures and Readings Assignments and Quizes CS 123 Introduction to Computer Graphics Brown University This course offers an in-depth exploration of fundamental concepts in 2D and 3D computer graphics. It introduces 2D raster graphics techniques, including scan conversion, simple image processing, interaction techniques and user interface design. The bulk of the course is devoted to 3D modeling, geometric transformations, and 3D viewing and rendering. Lectures Labs Demos CS 378 3D Reconstruction with Computer Vision UTexas In this lab-based class, we'll dive into practical applications of 3D reconstruction, combining hardware and software to build our own 3D environments from scratch. We'll use open-source frameworks like OpenCV to do the heavy lifting, with the focus on understanding and applying state-of-the art approaches to geometric computer vision Lectures CS 4620 Introduction to Computer Graphics Cornell University The study of creating, manipulating, and using visual images in the computer. Assignments Exams CS 4670 Introduction to Computer Vision Cornell University This course will provide an introduction to computer vision, with topics including image formation, feature detection, motion estimation, image mosaics, 3D shape reconstruction, and object and face detection and recognition. Applications of these techniques include building 3D maps, creating virtual characters, organizing photo and video databases, human computer interaction, video surveillance, automatic vehicle navigation, and mobile computer vision. This is a project-based course, in which you will implement several computer vision algorithms throughout the semester. Assignments Lectures CS 6670 Computer Vision Cornell University Introduction to computer vision. Topics include edge detection, image segmentation, stereopsis, motion and optical flow, image mosaics, 3D shape reconstruction, and object recognition. Students are required to implement several of the algorithms covered in the course and complete a final project. Syllabus Lectures Assignments CSCI-GA.2270-001 Graduate Computer Graphics New York University Step-by-step study computer graphics, with reading and homework at each lecture (Fall2015) Lectures Misc AM 207 Monte Carlo Methods and Stochastic Optimization Harvard University This course introduces important principles of Monte Carlo techniques and demonstrates the power of these techniques with simple (but very useful) applications. All of this in Python! Lecture Videos Assignments Lecture Notes CS 75 Introduction to Game Development Tufts University The course taught by Ming Y. Chow teaches game development initially in PyGame through Python, before moving on to addressing all facets of game development. Topics addressed include game physics, sprites, animation, game development methodology, sound, testing, MMORPGs and online games, and addressing mobile development in Android, HTML5, and iOS. Most to all of the development is focused on PyGame for learning principles Text Lectures Assignments Labs CS 100 Open Source Software Construction UC Riverside This is a course on how to be a hacker. Your first four homework assignments walk you through the process of building your own unix shell. You'll be developing it as an open source project, and you will collaborate with each other at various points. Github Page Assignments CS 108 Object Oriented System Design Stanford Software design and construction in the context of large OOP libraries. Taught in Java. Topics: OOP design, design patterns, testing, graphical user interface (GUI) OOP libraries, software engineering strategies, approaches to programming in teams. CS 168 Computer Networks UC Berkeley This is an undergraduate level course covering the fundamental concepts of networking as embodied in the Internet. The course will cover a wide range of topics; see the lecture schedule for more details. While the class has a textbook, we will not follow its order of presentation but will instead use the text as a reference when covering each individual topic. The course will also have several projects that involve programming (in Python). You should know programming, data structures, and software engineering. In terms of mathematics, your algebra should be very solid, you need to know basic probability, and you should be comfortable with thinking abstractly. The TAs will spend very little time reviewing material that is not specific to networking. We assume that you either know the material covered in those courses, or are willing to learn the material as necessary. We won't cover any of this material in lecture. CS 193a Android App Development, Spring 2016 Stanford University Course Description: This course provides an introduction to developing applications for the Android mobile platform. Prerequisite: CS 106B or equivalent. Java experience highly recommended. OOP highly recommmended. Devices: Access to an Android phone and/or tablet recommended but not required. Videos: Videos list can be found here Other materials: Some codes, handsout, homework ..... and lecture notes are not downloadable on the site due to login requirement. Please head to my Github repo here to download them. CS 193p Developing Applications for iOS Stanford University Updated for iOS 7. Tools and APIs required to build applications for the iPhone and iPad platform using the iOS SDK. User interface designs for mobile devices and unique user interactions using multi-touch technologies. Object-oriented design using model-view-controller paradigm, memory management, Objective-C programming language. Other topics include: object-oriented database API, animation, multi-threading and performance considerations. Prerequisites: C language and object-oriented programming experience Recommended: Programming Abstractions Updated courses for iOS8 - Swift Updated courses for iOS9 - Swift CS 223A Introduction to Robotics Stanford University The purpose of this course is to introduce you to basics of modeling, design, planning, and control of robot systems. In essence, the material treated in this course is a brief survey of relevant results from geometry, kinematics, statics, dynamics, and control. Lectures Assignments CS 262a Advanced Topics in Computer Systems UC Berkeley CS262a is the first semester of a year-long sequence on computer systems research, including operating systems, database systems, and Internet infrastructure systems. The goal of the course is to cover a broad array of research topics in computer systems, and to engage you in top-flight systems research. The first semester is devoted to basic thematic issues and underlying techniques in computer systems, while the second semester goes deeper into topics related to scalable, parallel and distributed systems. The class is based on a discussion of important research papers and a research project. Parts: Some Classics, Persistent Storage, Concurrency, Higher-Level Models, Virtual Machines, Cloud Computing, Parallel and Distributed Computing, Potpourri. Prerequisites: The historical prerequisite was to pass an entrance exam in class, which covered undergraduate operating systems material (similar to UCB's CS162). There is no longer an exam. However, if you have not already taken a decent undergrad OS class, you should talk with me before taking this class. The exam had the benefit of "paging in" the undergrad material, which may have been its primary value (since the pass rate was high). Readings & Lectures CS 294 Cutting-edge Web Technologies Berkeley Want to learn what makes future web technologies tick? Join us for the class where we will dive into the internals of many of the newest web technologies, analyze and dissect them. We will conduct survey lectures to provide the background and overview of the area as well as invite guest lecturers from various leading projects to present their technologies. CS 411 Software Architecture Design Bilkent University This course teaches the basic concepts, methods and techniques for designing software architectures. The topics include: rationale for software architecture design, modeling software architecture design, architectural styles/patterns, architectural requirements analysis, comparison and evaluation of architecture design methods, synthesis-based software architecture design, software product-line architectures, domain modeling, domain engineering and application engineering, software architecture implementation, evaluating software architecture designs. CS 3152 Introduction to Computer Game Development Cornell University A project-based course in which programmers and designers collaborate to make a computer game. This course investigates the theory and practice of developing computer games from a blend of technical, aesthetic, and cultural perspectives. Technical aspects of game architecture include software engineering, artificial intelligence, game physics, computer graphics, and networking. Aesthetic and cultural include art and modeling, sound and music, game balance, and player experience. Syllabus Lectures Assignments CS 4152 Advanced Topics in Computer Game Development Cornell University Project-based follow-up course to CS/INFO 3152. Students work in a multidisciplinary team to develop a game that incorporates innovative game technology. Advanced topics include 3D game development, mobile platforms, multiplayer gaming, and nontraditional input devices. There is a special emphasis on developing games that can be submitted to festivals and competitions, or that can be commercialized. Syllabus Lectures Assignments CS 4154 Analytics-driven Game Design Cornell University A project-based course in which programmers and designers collaborate to design, implement, and release a video game online through popular game portals. In this course, students will use the internet to gather data anonymously from players. Students will analyze this data in order to improve their game over multiple iterations. Technical aspects of this course include programming, database architecture, and statistical analysis. Syllabus Lectures Assignments CS 4812 Quantum Information Processing Cornell University Hardware that exploits quantum phenomena can dramatically alter the nature of computation. Though constructing a working quantum computer is a formidable technological challenge, there has been much recent experimental progress. In addition, the theory of quantum computation is of interest in itself, offering strikingly different perspectives on the nature of computation and information, as well as providing novel insights into the conceptual puzzles posed by the quantum theory. The course is intended both for physicists, unfamiliar with computational complexity theory or cryptography, and also for computer scientists and mathematicians, unfamiliar with quantum mechanics. The prerequisites are familiarity (and comfort) with finite dimensional vector spaces over the complex numbers, some standard group theory, and ability to count in binary. Syllabus Lectures CS 4860 Applied Logic Cornell University In addition to basic first-order logic, when taught by Computer Science this course involves elements of Formal Methods and Automated Reasoning. Formal Methods is concerned with proving properties of algorithms, specifying programming tasks and synthesizing programs from proofs. We will use formal methods tools such as interactive proof assistants (see www.nuprl.org). We will also spend two weeks on constructive type theory, the language used by the Coq and Nuprl proof assistants. Syllabus Lectures Assignments CS 5150 Software Engineering Cornell University Introduction to the practical problems of specifying, designing, building, testing, and delivering reliable software systems Lectures Assignments CS 5220 Applications of Parallel Computers Cornell University How do we solve the large-scale problems of science quickly on modern computers? How do we measure the performance of new or existing simulation codes, and what things can we do to make them run faster? How can we best take advantage of features like multicore processors, vector units, and graphics co-processors? These are the types of questions we will address in CS 5220, Applications of Parallel Computers. Topics include: Single-processor architecture, caches, and serial performance tuning Basics of parallel machine organization Distributed memory programming with MPI Shared memory programming with OpenMP Parallel patterns: data partitioning, synchronization, and load balancing Examples of parallel numerical algorithms Applications from science and engineering Lectures Assignments CS 5540 Computational Techniques for Analyzing Clinical Data Cornell University CS5540 is a masters-level course that covers a wide range of clinical problems and their associated computational challenges. The practice of medicine is filled with digitally accessible information about patients, ranging from EKG readings to MRI images to electronic health records. This poses a huge opportunity for computer tools that make sense out of this data. Computation tools can be used to answer seemingly straightforward questions about a single patient's test results (“Does this patient have a normal heart rhythm?”), or to address vital questions about large populations (“Is there any clinical condition that affects the risks of Alzheimer”). In CS5540 we will look at many of the most important sources of clinical data and discuss the basic computational techniques used for their analysis, ranging in sophistication from current clinical practice to state-of-the-art research projects. Syllabus Lectures Assignments CS 5724 Evolutionary Computation Cornell University This course will cover advanced topics in evolutionary algorithms and their application to open-ended computational design. The field of evolutionary computation tries to address large-scale optimization and planning problems through stochastic population-based methods. It draws inspiration from evolutionary processes in nature and in engineering, and also serves as abstract models for these phenomena. Evolutionary processes are generally weak methods that require little information about the problem domain and hence can be applied across a wide variety of applications. They are especially useful for open-ended problem domains for which little formal knowledge exists and the number of parameters is undefined, such as for the general engineering design process. This course will provide insight to a variety of evolutionary computation paradigms, such as genetic algorithms, genetic programming, and evolutionary strategies, as well as governing dynamics of co-evolution, arms races and mediocre stable states. New methods involving symbiosis models and pattern recognition will also be presented. The material will be intertwined with discussions of representations and results for design problems in a variety of problem domains including software, electronics, and mechanics. Syllabus Lectures Assignments CS 6452 Evolutionary Computation Cornell University CS6452 focuses on datacenter networks and services. The emerging demand for web services and cloud computing have created need for large scale data centers. The hardware and software infrastructure for datacenters critically determines the functionality, performance, cost and failure tolerance of applications running on that datacenter. This course will examine design alternatives for both the hardware (networking) infrastructure, and the software infrastructure for datacenters. Syllabus Lectures CS 6630 Realistic Image Synthesis Cornell University CS6630 is an introduction to physics-based rendering at the graduate level. Starting from the fundamentals of light transport we will look at formulations of the Rendering Equation, and a series of Monte Carlo methods, from sequential sampling to multiple importance sampling to Markov Chains, for solving the equation to make pictures. We'll look at light reflection from surfaces and scattering in volumes, illumination from luminaries and environments, and diffusion models for translucent materials. We will build working implementations of many of the algorithms we study, and learn how to make sure they are actually working correctly. It's fun to watch integrals and probability distributions transform into photographs of a slightly too perfect synthetic world. Syllabus Lectures Assignments Readings CS 6640 Computational Photography Cornell University A course on the emerging applications of computation in photography. Likely topics include digital photography, unconventional cameras and optics, light field cameras, image processing for photography, techniques for combining multiple images, advanced image editing algorithms, and projector-camera systems.cornell.edu/courses/CS6630/2012sp/about.stm) Lectures Assignments CS 6650 Computational Motion Cornell University Covers computational aspects of motion, broadly construed. Topics include the computer representation, modeling, analysis, and simulation of motion, and its relationship to various areas, including computational geometry, mesh generation, physical simulation, computer animation, robotics, biology, computer vision, acoustics, and spatio-temporal databases. Students implement several of the algorithms covered in the course and complete a final project. This offering will also explore the special role of motion processing in physically based sound rendering. CS 6840 Algorithmic Game Theory Cornell University Algorithmic Game Theory combines algorithmic thinking with game-theoretic, or, more generally, economic concepts. The course will study a range of topics at this interface Syllabus Lectures Assignments Readings CSE 154 Web Programming University of Washington This course is an introduction to programming for the World Wide Web. Covers use of HTML, CSS, PHP, JavaScript, AJAX, and SQL. Lectures Assignments ESM 296-4F GIS & Spatial Analysis UC Santa Barbara Taught by James Frew, Ben Best, and Lisa Wedding Focuses on specific computational languages (e.g., Python, R, shell) and tools (e.g., GDAL/OGR, InVEST, MGET, ModelBuilder) applied to the spatial analysis of environmental problems GitHub (includes lecture materials and labs) ICS 314 Software Engineering University of Hawaii Taught by Philip Johnson Introduction to software engineering using the "Athletic Software Engineering" pedagogy Readings Experiences Assessments IGME 582 Humanitarian Free & Open Source Software Development Rochester Institute of Technology This course provides students with exposure to the design, creation and production of Open Source Software projects. Students will be introduced to the historic intersections of technology and intellectual property rights and will become familiar with Open Source development processes, tools and practices. I485 / H400 Biologically Inspired Computation Indiana University Course taught by Luis Rocha about the multi-disciplinary field algorithms inspired by naturally occurring phenomenon. This course provides introduces the following areas: L-systems, Cellular Automata, Emergence, Genetic Algorithms, Swarm Intelligence and Artificial Immune Systems. It's aim is to cover the fundamentals and enable readers to build up a proficiency in applying various algorithms to real-world problems. Lectures Assignments Open Sourced Elective: Database and Rails Intro to Ruby on Rails University of Texas An introductory course in Ruby on Rails open sourced by University of Texas' CS Adjunct Professor, Richard Schneeman. Lectures Assignments Videos SCICOMP An Introduction to Efficient Scientific Computation Universität Bremen This is a graduate course in scientific computing created and taught by Oliver Serang in 2014, which covers topics in computer science and statistics with applications from biology. The course is designed top-down, starting with a problem and then deriving a variety of solutions from scratch. Topics include memoization, recurrence closed forms, string matching (sorting, hash tables, radix tries, and suffix tries), dynamic programming (e.g. Smith-Waterman and Needleman-Wunsch), Bayesian statistics (e.g. the envelope paradox), graphical models (HMMs, Viterbi, junction tree, belief propagation), FFT, and the probabilistic convolution tree. Lecture videos on Youtube and for direct download 14-740 Fundamentals of Computer Networks CMU This is an introductory course on Networking for graduate students. It follows a top-down approach to teaching Computer Networks, so it starts with the Application layer which most of the students are familiar with and as the course unravels we learn more about transport, network and link layers of the protocol stack. As far as prerequisites are concerned - basic computer, programming and probability theory background is required. The course site contains links to the lecture videos, reading material and assignments. Contact GitHub API Training Shop Blog About © 2017 GitHub, Inc. Terms Privacy Security Status Help
Sample Classification Code of CIFAR-10 in Torch from: http://torch.ch/blog/2015/07/30/cifar.html require 'xlua' require 'optim' require 'nn' require 'image' local c = require 'trepl.colorize' opt = lapp[[ -s,--save (default "logs") subdirectory to save logs -b,--batchSize (default 128) batch size -r,--learningRate (default 1) learning rate --learningRateDecay (default 1e-7) learning rate decay --weightDecay (default 0.0005) weightDecay -m,--momentum (default 0.9) momentum --epoch_step (default 25) epoch step --model (default vgg_bn_drop) model name --max_epoch (default 300) maximum number of iterations --backend (default nn) backend --type (default cuda) cuda/float/cl ]] print(opt) do -- data augmentation module local BatchFlip,parent = torch.class('nn.BatchFlip', 'nn.Module') function BatchFlip:__init() parent.__init(self) self.train = true end function BatchFlip:updateOutput(input) if self.train then local bs = input:size(1) local flip_mask = torch.randperm(bs):le(bs/2) for i=1,input:size(1) do if flip_mask[i] == 1 then image.hflip(input[i], input[i]) end end end self.output:set(input) return self.output end end local function cast(t) if opt.type == 'cuda' then require 'cunn' return t:cuda() elseif opt.type == 'float' then return t:float() elseif opt.type == 'cl' then require 'clnn' return t:cl() else error('Unknown type '..opt.type) end end print(c.blue '==>' ..' configuring model') local model = nn.Sequential() model:add(nn.BatchFlip():float()) model:add(cast(nn.Copy('torch.FloatTensor', torch.type(cast(torch.Tensor()))))) model:add(cast(dofile('models/'..opt.model..'.lua'))) model:get(2).updateGradInput = function(input) return end if opt.backend == 'cudnn' then require 'cudnn' cudnn.benchmark=true cudnn.convert(model:get(3), cudnn) end print(model) print(c.blue '==>' ..' loading data') ------------------------------------------------------------------------------------------- ---------------------------- Load the Train and Test data ------------------------------- ------------------------------------------------------------------------------------------- local trsize = 50000 local tesize = 10000 -- load dataset trainData = { data = torch.Tensor(50000, 3072), labels = torch.Tensor(50000), size = function() return trsize end } local trainData = trainData for i = 0,4 do local subset = torch.load('cifar-10-batches-t7/data_batch_' .. (i+1) .. '.t7', 'ascii') trainData.data[{ {i*10000+1, (i+1)*10000} }] = subset.data:t() trainData.labels[{ {i*10000+1, (i+1)*10000} }] = subset.labels end trainData.labels = trainData.labels + 1 local subset = torch.load('cifar-10-batches-t7/test_batch.t7', 'ascii') testData = { data = subset.data:t():double(), labels = subset.labels[1]:double(), size = function() return tesize end } local testData = testData testData.labels = testData.labels + 1 -- resize dataset (if using small version) trainData.data = trainData.data[{ {1,trsize} }] trainData.labels = trainData.labels[{ {1,trsize} }] testData.data = testData.data[{ {1,tesize} }] testData.labels = testData.labels[{ {1,tesize} }] -- reshape data trainData.data = trainData.data:reshape(trsize,3,32,32) testData.data = testData.data:reshape(tesize,3,32,32) ---------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- -- preprocessing data (color space + normalization) ---------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- print '<trainer> preprocessing data (color space + normalization)' collectgarbage() -- preprocess trainSet local normalization = nn.SpatialContrastiveNormalization(1, image.gaussian1D(7)) for i = 1,trainData:size() do xlua.progress(i, trainData:size()) -- rgb -> yuv local rgb = trainData.data[i] local yuv = image.rgb2yuv(rgb) -- normalize y locally: yuv[1] = normalization(yuv[{{1}}]) trainData.data[i] = yuv end -- normalize u globally: local mean_u = trainData.data:select(2,2):mean() local std_u = trainData.data:select(2,2):std() trainData.data:select(2,2):add(-mean_u) trainData.data:select(2,2):div(std_u) -- normalize v globally: local mean_v = trainData.data:select(2,3):mean() local std_v = trainData.data:select(2,3):std() trainData.data:select(2,3):add(-mean_v) trainData.data:select(2,3):div(std_v) trainData.mean_u = mean_u trainData.std_u = std_u trainData.mean_v = mean_v trainData.std_v = std_v -- preprocess testSet for i = 1,testData:size() do xlua.progress(i, testData:size()) -- rgb -> yuv local rgb = testData.data[i] local yuv = image.rgb2yuv(rgb) -- normalize y locally: yuv[{1}] = normalization(yuv[{{1}}]) testData.data[i] = yuv end -- normalize u globally: testData.data:select(2,2):add(-mean_u) testData.data:select(2,2):div(std_u) -- normalize v globally: testData.data:select(2,3):add(-mean_v) testData.data:select(2,3):div(std_v) ---------------------------------------------------------------------------------- ----------------------------- END --------------------------------------------- trainData.data = trainData.data:float() testData.data = testData.data:float() confusion = optim.ConfusionMatrix(10) print('Will save at '..opt.save) paths.mkdir(opt.save) testLogger = optim.Logger(paths.concat(opt.save, 'test.log')) testLogger:setNames{'% mean class accuracy (train set)', '% mean class accuracy (test set)'} testLogger.showPlot = false parameters,gradParameters = model:getParameters() print(c.blue'==>' ..' setting criterion') criterion = cast(nn.CrossEntropyCriterion()) print(c.blue'==>' ..' configuring optimizer') optimState = { learningRate = opt.learningRate, weightDecay = opt.weightDecay, momentum = opt.momentum, learningRateDecay = opt.learningRateDecay, } function train() model:training() epoch = epoch or 1 -- drop learning rate every "epoch_step" epochs if epoch % opt.epoch_step == 0 then optimState.learningRate = optimState.learningRate/2 end print(c.blue '==>'.." online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']') local targets = cast(torch.FloatTensor(opt.batchSize)) local indices = torch.randperm(trainData.data:size(1)):long():split(opt.batchSize) -- remove last element so that all the batches have equal size indices[#indices] = nil local tic = torch.tic() for t,v in ipairs(indices) do xlua.progress(t, #indices) local inputs = trainData.data:index(1,v) targets:copy(trainData.labels:index(1,v)) local feval = function(x) if x ~= parameters then parameters:copy(x) end gradParameters:zero() local outputs = model:forward(inputs) local f = criterion:forward(outputs, targets) local df_do = criterion:backward(outputs, targets) model:backward(inputs, df_do) confusion:batchAdd(outputs, targets) return f,gradParameters end optim.sgd(feval, parameters, optimState) end confusion:updateValids() print(('Train accuracy: '..c.cyan'%.2f'..' %%\t time: %.2f s'):format( confusion.totalValid * 100, torch.toc(tic))) train_acc = confusion.totalValid * 100 confusion:zero() epoch = epoch + 1 end function test() -- disable flips, dropouts and batch normalization model:evaluate() print(c.blue '==>'.." testing") local bs = 125 for i=1,testData.data:size(1),bs do local outputs = model:forward(testData.data:narrow(1,i,bs)) confusion:batchAdd(outputs, testData.labels:narrow(1,i,bs)) end confusion:updateValids() print('Test accuracy:', confusion.totalValid * 100) if testLogger then paths.mkdir(opt.save) testLogger:add{train_acc, confusion.totalValid * 100} testLogger:style{'-','-'} testLogger:plot() if paths.filep(opt.save..'/test.log.eps') then local base64im do os.execute(('convert -density 200 %s/test.log.eps %s/test.png'):format(opt.save,opt.save)) os.execute(('openssl base64 -in %s/test.png -out %s/test.base64'):format(opt.save,opt.save)) local f = io.open(opt.save..'/test.base64') if f then base64im = f:read'*all' end end local file = io.open(opt.save..'/report.html','w') file:write(([[ <!DOCTYPE html> <html> <body> <title>%s - %s</title> <img src="data:image/png;base64,%s"> <h4>optimState:</h4> <table> ]]):format(opt.save,epoch,base64im)) for k,v in pairs(optimState) do if torch.type(v) == 'number' then file:write('<tr><td>'..k..'</td><td>'..v..'</td></tr>\n') end end file:write'</table><pre>\n' file:write(tostring(confusion)..'\n') file:write(tostring(model)..'\n') file:write'</pre></body></html>' file:close() end end -- save model every 50 epochs if epoch % 50 == 0 then local filename = paths.concat(opt.save, 'model.net') print('==> saving model to '..filename) torch.save(filename, model:get(3):clearState()) end confusion:zero() end for i=1,opt.max_epoch do train() test() end View Code the original version code: why they written like this ? It can not run ...
Learning Deep Learning with Keras Piotr Migdał - blog Projects Articles Publications Resume About Photos Learning Deep Learning with Keras 30 Apr 2017 • Piotr Migdał • [machine-learning] [deep-learning] [overview] I teach deep learning both for a living (as the main deepsense.io instructor, in a Kaggle-winning team1) and as a part of my volunteering with the Polish Children’s Fund giving workshops to gifted high-school students2. I want to share a few things I’ve learnt about teaching (and learning) deep learning. Whether you want to start learning deep learning for you career, to have a nice adventure (e.g. with detecting huggable objects) or to get insight into machines before they take over3, this post is for you! Its goal is not to teach neural networks by itself, but to provide an overview and to point to didactically useful resources. Don’t be afraid of artificial neural networks - it is easy to start! In fact, my biggest regret is delaying learning it, because of the perceived difficulty. To start, all you need is really basic programming, very simple mathematics and knowledge of a few machine learning concepts. I will explain where to start with these requirements. In my opinion, the best way to start is from a high-level interactive approach (see also: Quantum mechanics for high-school students and my Quantum Game with Photons). For that reason, I suggest starting with image recognition tasks in Keras, a popular neural network library in Python. If you like to train neural networks with less code than in Keras, the only viable option is to use pigeons. Yes, seriously: pigeons spot cancer as well as human experts! What is deep learning and why is it cool? Deep learning is a name for machine learning techniques using many-layered artificial neural networks. Occasionally people use the term artificial intelligence, but unless you want to sound sci-fi, it is reserved for problems that are currently considered “too hard for machines” - a frontier that keeps moving rapidly. This is a field that exploded in the last few years, reaching human-level accuracy in visual recognition tasks (among many other tasks). Unlike quantum computing, or nuclear fusion - it is a technology that is being applied right now, not some possibility for the future. There is a rule of thumb: Pretty much anything that a normal person can do in <1 sec, we can now automate with AI. - Andrew Ng’s tweet Some people go even further, extrapolating that statement to experts. It’s not a surprise that companies like Google and Facebook at the cutting-edge of progress. In fact, every few months I am blown away by something exceeding my expectations, e.g.: The Unreasonable Effectiveness of Recurrent Neural Networks4 for generating fake Shakespeare, Wikipedia entries and LaTeX articles A Neural Algorithm of Artistic Style style transfer (and for videos!) Real-time Face Capture and Reenactment Colorful Image Colorization Plug & Play Generative Networks for photorealistic image generation Dermatologist-level classification of skin cancer along with other medical diagnostic tools Image-to-Image Translation (pix2pix) - sketch to photo Teaching Machines to Draw sketches of cats, dogs etc It looks like some sorcery. If you are curious what neural networks are, take a look at this series of videos for a smooth introduction: Neural Networks Demystified by Stephen Welch - video series A Visual and Interactive Guide to the Basics of Neural Networks by J Alammar These techniques are data-hungry. See a plot of AUC score for logistic regression, random forest and deep learning on Higgs dataset (data points are in millions): In general there is no guarantee that, even with a lot of data, deep learning does better than other techniques, for example tree-based such as random forest or boosted trees. Let’s play! Do I need some Skynet to run it? Actually not - it’s a piece of software, like any other. And you can even play with it in your browser: TensorFlow Playground for point separation, with a visual interface ConvNetJS for digit and image recognition Keras.js Demo - to visualize and use real networks in your browser (e.g. ResNet-50) Or… if you want to use Keras in Python, see this minimal example - just to get convinced you can use it on your own computer. Python and machine learning I mentioned basics Python and machine learning as a requirement. They are already covered in my introduction to data science in Python and statistics and machine learning sections, respectively. For Python, if you already have Anaconda distribution (covering most data science packages), the only thing you need is to install TensorFlow and Keras. When it comes to machine learning, you don’t need to learn many techniques before jumping into deep learning. Though, later it would be a good practice to see if a given problem can be solved with much simpler methods. For example, random forest is often a lockpick, working out-of-the-box for many problems. You need to understand why we need to train and then test a classifier (to validate its predictive power). To get the gist of it, start with this beautiful tree-based animation: Visual introduction to machine learning by Stephanie Yee and Tony Chu Also, it is good to understand logistic regression, which is a building block of almost any neural network for classification. Mathematics Deep learning (that is - neural networks with many layers) uses mostly very simple mathematical operations - just many of them. Here there are a few, which you can find in almost any network (look at this list, but don’t get intimidated): vectors, matrices, multi-dimensional arrays, addition, multiplication, convolutions to extract and process local patterns, activation functions: sigmoid, tanh or ReLU to add non-linearity, softmax to convert vectors into probabilities, log-loss (cross-entropy) to penalize wrong guesses in a smart way, gradients and chain-rule (backpropagation) for optimizing network parameters, stochastic gradient descent and its variants (e.g. momentum). If your background is in mathematics, statistics, physics5 or signal processing - most likely you already know more than enough to start! If your last contact with mathematics was in high-school, don’t worry. Its mathematics is simple to the point that a convolutional neural network for digit recognition can be implemented in a spreadsheet (with no macros), see: Deep Spreadsheets with ExcelNet. It is only a proof-of-principle solution - not only inefficient, but also lacking the most crucial part - the ability to train new networks. The basics of vector calculus are crucial not only for deep learning, but also for many other machine learning techniques (e.g. in word2vec I wrote about). To learn it, I recommend starting from one of the following: J. Ström, K. Åström, and T. Akenine-Möller, Immersive Linear Algebra - a linear algebra book with fully interactive figures Applied Math and Machine Learning Basics: Linear Algebra from the Deep Learning book Linear algebra cheat sheet for deep learning by Brendan Fortuner Since there are many references to NumPy, it may be useful to learn its basics: From Python to Numpy by Nicolas P. Rougier SciPy lectures: The NumPy array object At the same time - look back at the meme, at the What mathematicians think I do part. It’s totally fine to start from a magically working code, treating neural network layers like LEGO blocks. Frameworks There is a handful of popular deep learning libraries, including TensorFlow, Theano, Torch and Caffe. Each of them has Python interface (now also for Torch: PyTorch). So, which to choose? First, as always, screw all subtle performance benchmarks, as premature optimization is the root of all evil. What is crucial is to start with one which is easy to write (and read!), one with many online resources, and one that you can actually install on your computer without too much pain. Bear in mind that core frameworks are multidimensional array expression compilers with GPU support. Current neural networks can be expressed as such. However, if you just want to work with neural networks, by rule of least power, I recommend starting with a framework just for neural networks. For example… Keras If you like the philosophy of Python (brevity, readability, one preferred way to do things), Keras is for you. It is a high-level library for neural networks, using TensorFlow or Theano as its backend. Also, if you want to have a propaganda picture, there is a possibly biased (or overfitted?) popularity ranking: The state of deep learning frameworks (from GitHub metrics), April 2017. - François Chollet (Keras creator) If you want to consult a different source, based on arXiv papers rather than GitHub activity, see A Peek at Trends in Machine Learning by Andrej Karpathy. Popularity is important - it means that if you want to search for a network architecture, googling for it (e.g. UNet Keras) is likely to return an example. Where to start learning it? Documentation on Keras is nice, and its blog is a valuable resource. For a complete, interactive introduction to deep learning with Keras in Jupyter Notebook, I really recommend: Deep Learning with Keras and TensorFlow by Valerio Maggio For shorter ones, try one of these: Visualizing parts of Convolutional Neural Networks using Keras and Cats by Erik Reppel Deep learning for complete beginners: convolutional neural networks with Keras by Petar Veličković Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras by Jason Brownlee (Theano tensor dimension order6) There are a few add-ons to Keras, which are especially useful for learning it. I created ASCII summary for sequential models to show data flow inside networks (in a nicer way than model.summary()). It shows layers, dimensions of data (x, y, channels) and the number of free parameters (to be optimized). For example, for a network for digit recognition it might look like: OPERATION DATA DIMENSIONS WEIGHTS(N) WEIGHTS(%) Input ##### 32 32 3 Conv2D \|/ ------------------- 896 0.1% relu ##### 32 32 32 Conv2D \|/ ------------------- 9248 0.7% relu ##### 30 30 32 MaxPooling2D Y max ------------------- 0 0.0% ##### 15 15 32 Dropout | || ------------------- 0 0.0% ##### 15 15 32 Conv2D \|/ ------------------- 18496 1.5% relu ##### 15 15 64 Conv2D \|/ ------------------- 36928 3.0% relu ##### 13 13 64 MaxPooling2D Y max ------------------- 0 0.0% ##### 6 6 64 Dropout | || ------------------- 0 0.0% ##### 6 6 64 Flatten ||||| ------------------- 0 0.0% ##### 2304 Dense XXXXX ------------------- 1180160 94.3% relu ##### 512 Dropout | || ------------------- 0 0.0% ##### 512 Dense XXXXX ------------------- 5130 0.4% softmax ##### 10 You might be also interested in nicer progress bars with keras-tqdm, exploration of activations at each layer with quiver or converting Keras models to JavaScript, runnable in a browser with Keras.js. TensorFlow If not Keras, then I recommend starting with bare TensorFlow. It is a bit more low-level and verbose, but makes it straightforward to optimize various multidimensional array (or, well, tensor) operations. A few good resources: the official TensorFlow Tutorial is very good Learn TensorFlow and deep learning, without a Ph.D. by Martin Görner TensorFlow Tutorial and Examples for beginners by Aymeric Damien (with Python 2.7) Simple tutorials using Google’s TensorFlow Framework by Nathan Lintz In any case, TensorBoard makes it easy to keep track of the training process. It can also be used with Keras, via callbacks. Other Theano is similar to TensorFlow, but a bit older and harder to start. For example, you need to manually write updates of variables. Typical neural network layers are not included, so one often uses libraries such as Lasagne. If you’re looking for a place to start, I like this introduction: Theano Tutorial by Marek Rei At the same time, if you see some nice code in Torch or PyTorch, don’t be afraid to install and run it! Datasets Every machine learning problem needs data. You cannot just tell it “detect if there is a cat in this picture” and expect the computer to tell you the answer. You need to show many instances of cats, and pictures not containing cats, and (hopefully) it will learn to generalize it to other cases. So, you need some data to start. And it is not a drawback of machine learning or just deep learning - it is a fundamental property of any learning! Before you dive into uncharted waters, it is good to take a look at some popular datasets. The key part about them is that they are… popular. It means that you can find a lot of examples what works. And have a guarantee that these problems can be solved with neural networks. MNIST Many good ideas will not work well on MNIST (e.g. batch norm). Inversely many bad ideas may work on MNIST and no[t] transfer to real [computer vision]. - François Chollet’s tweet Still, I recommend starting with the MNIST digit recognition dataset (60k grayscale 28x28 images), included in keras.datasets. Not necessary to master it, but just to get a sense that it works at all (or to test the basics of Keras on your local machine). notMNIST Indeed, I once even proposed that the toughest challenge facing AI workers is to answer the question: “What are the letters ‘A’ and ‘I’? - Douglas R. Hofstadter (1995) A more interesting dataset, and harder for classical machine learning algorithms, is notMNIST (letters A-J from strange fonts). If you want to start with it, here is my code for notMNIST loading and logistic regression in Keras. CIFAR If you want to play with image recognition, there is CIFAR dataset, a dataset of 32x32 photos (also in keras.datasets). It comes in two versions: 10 simple classes (including cats, dogs, frogs and airplanes ) and 100 harder and more nuanced classes (including beaver, dolphin, otter, seal and whale). I strongly suggest starting with CIFAR-10, the simpler version. Beware, more complicated networks may take quite some time (~12h on CPU my 7 year old Macbook Pro). More Deep learning requires a lot of data. If you want to train your network from scratch, it may require as many as ~10k images even if low-resolution (32x32). Especially if data is scarce, there is no guarantee that a network will learn anything. So, what are the ways to go? use really low res (if your eye can see it, no need to use higher resolution) get a lot of data (for images like 256x256 it may be: millions of instances) re-train a network that already saw a lot generate much more data (with rotations, shifts, distortions) Often, it’s a combination of everything mentioned here. Standing on the shoulders of giants Creating a new neural network has a lot in common with cooking - there are typical ingredients (layers) and recipes (popular network architectures). The most important cooking contest is ImageNet Large Scale Visual Recognition Challenge, with recognition of hundreds of classes from half a million dataset of photos. Look at these Neural Network Architectures, typically using 224x224x3 input (chart by Eugenio Culurciello): Circle size represents the number of parameters (a lot!). It doesn’t mention SqueezeNet though, an architecture vastly reducing the number of parameters (e.g. 50x fewer). A few key networks for image classification can be readily loaded from the keras.applications module: Xception, VGG16, VGG19, ResNet50, InceptionV3. Some others are not as plug & play, but still easy to find online - yes, there is SqueezeNet in Keras. These networks serve two purposes: they give insight into useful building blocks and architectures they are great candidates for retraining (so-called transfer learning), when using architecture along with pre-trained weights) Some other important network architectures for images: U-Net: Convolutional Networks for Biomedical Image Segmentation Retina blood vessel segmentation with a convolution neural network - Keras implementation Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras A Neural Algorithm of Artistic StyleA Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN by Dhruv Parthasarathy Neural Style Transfer & Neural Doodles implemented in Keras by Somshubra Majumdar Another set of insights: The Neural Network Zoo by Fjodor van Veen How to train your Deep Neural Network - how many layers, parameters, etc Infrastructure For very small problems (e.g. MNIST, notMNIST), you can use your personal computer - even if it is a laptop and computations are on CPU. For small problems (e.g. CIFAR, the unreasonable RNN), you might be still able to use a PC, but it requires much more patience and trade-offs. For medium and larger problems, essentially the only way to go is to use a machine with a strong graphic card (GPU). For example, it took us 2 days to train a model for satellite image processing for a Kaggle competition, see our: Deep learning for satellite imagery via image segmentation by Arkadiusz Nowaczyński On a strong CPU it would have taken weeks, see: Benchmarks for popular convolutional neural network models by Justin Johnson The easiest, and the cheapest, way to use a strong GPU is to rent a remote machine on a per-hour basis. You can use Amazon (it is not only a bookstore!), here are some guides: Keras with GPU on Amazon EC2 – a step-by-step instruction by Mateusz Sieniawski, my mentee Running Jupyter notebooks on GPU on AWS: a starter guide by Francois Chollet Further learning I encourage you to interact with code. For example, notMNIST or CIFAR-10 can be great starting points. Sometimes the best start is to start with someone’s else code and run it, then see what happens when you modify parameters. For learning how it works, this one is a masterpiece: CS231n: Convolutional Neural Networks for Visual Recognition by Andrej Karpathy and the lecture videos When it comes to books, there is a wonderful one, starting from introduction to mathematics and machine learning learning context (it even covers log-loss and entropy in a way I like!): Deep Learning, An MIT Press book by Ian Goodfellow, Yoshua Bengio and Aaron Courville Alternatively, you can use (it may be good for an introduction with interactive materials, but I’ve found the style a bit long-winded): Neural Networks and Deep Learning by Michael Nielsen Other materials There are many applications of deep learning (it’s not only image recognition!). I collected some introductory materials to cover its various aspects (beware: they are of various difficulty). Don’t try to read them all - I list them for inspiration, not intimidation! General The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy How convolutional neural networks see the world - Keras Blog What convolutional neural networks look at when they see nudity – Clarifai Blog (NSFW) Convolutional neural networks for artistic style transfer by Harish Nrayanan Dreams, Drugs and ConvNets - my slides (NSFW); I am considering turning it into a longer post on machine learning vs human learning, based on common mistakes Technical Yes you should understand backprop by Andrej Karpathy Transfer Learning using Keras by Prakash Vanapalli Generative Adversarial Networks (GANs) in 50 lines of code (PyTorch) Minimal and Clean Reinforcement Learning Examples An overview of gradient descent optimization algorithms by Sebastian Ruder Picking an optimizer for Style Transfer by Slav Ivanov Building Autoencoders in Keras by Francois Chollet Understanding LSTM Networks by Chris Olah Recurrent Neural Networks & LSTMs by Rohan Kapur Oxford Deep NLP 2017 course List of resources How to Start Learning Deep Learning by Ofir Press A Guide to Deep Learning by YN^2 Staying up-to-date: r/MachineLearning Reddit channel covering most of new stuff distill.pub - an interactive, visual, open-access journal for machine learning research, with expository articles my links at pinboard.in/u:pmigdal/t:deep-learning - though, just saving, not an automatic recommendation @fastml_extra Twitter channel GitXiv for papers with code don’t be afraid to read academic papers; some are well-written and insightful (if you own Kindle or another e-reader, I recommend Dontprint) Data (usually from challenges) Kaggle AF Classification from a short single lead ECG recording: the PhysioNet/Computing in Cardiology Challenge 2017 iNaturalist 2017 Competition (675k images with 5k species), vide Mushroom AI Thanks I would like to thank Kasia Kulma, Martina Pugliese, Paweł Subko, Monika Pawłowska and Łukasz Kidziński for helpful feedback on the content and to Sarah Martin for polishing my English. If you recommend a source that helped you with your adventure with deep learning - feel invited to contact me! (@pmigdal for short links, an email for longer remarks.) The deep learning meme is not mine - I’ve just I rewrote from Theano to Keras (with TensorFlow backend). NOAA Right Whale Recognition, Winners’ Interview (1st place, Jan 2016), and a fresh one: Deep learning for satellite imagery via image segmentation (4th place, Apr 2017). ↩ This January during a 5-day workshop 6 high-school students participated in a rather NSFL project - constructing a neural network for detecting trypophobia triggers, see e.g. grzegorz225/trypophobia-detector and cytadela8/trypophobia_detector. ↩ It made a few episodes of webcomics obsolete: xkcd: Tasks (totally, by Park or Bird?), xkcd: Game AI) (partially, by AlphaGo), PHD Comics: If TV Science was more like REAL Science (not exactly, but still it’s cool, by LapSRN). ↩ The title alludes to The Unreasonable Effectiveness of Mathematics in the Natural Sciences by Eugene Wigner (1960), one of my favourite texts in philosophy of science. Along with More is Different by PW Andreson (1972) and Genesis and development of a scientific fact (pdf here) by Ludwik Fleck (1935). ↩ If your background is in quantum information, the only thing you need to change is ℂ to ℝ. Just expect less tensor structure, but more convolutions. ↩ Is it only me, or does Theano tensor dimension order sound like some secret convent? Before you start searching how to join it: it is about the shape of multi-dimensional arrays: (samples, channels, x, y) rather than TensorFlow’s (samples, x, y, channels). ↩
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http://jxgu.cc/blog/recent-advances-in-RNN.html References Robert Dionne Neural Network Paper Notes Baisc Improvements 20170326 Learning Simpler Language Models with the Delta Recurrent Neural Network Framework 20170316 Machine Learning on Sequential Data Using a Recurrent Weighted Average 20161029 Phased LSTM Accelerating Recurrent Network Training for Long or Event based Sequences 20161020 Using Fast Weights to Attend to the Recent Past 20161017 Interactive Attention for Neural Machine Translation 20160908 LSTM GRU Highway and a Bit of Attention An Empirical Overview for Language Modeling in Speech Recognition 20160811 Recurrent Highway Networks 20160721 Layer Normalization 20160713 Recurrent Memory Array Structures 20160524 Sequential Neural Models with Stochastic Layers 20160513 LSTM with Working Memory 20160412 Recurrent Batch Normalization 20160209 Associative Long Short-Term Memory 20151214 Memory-based control with recurrent neura networks 20151105 Quasi-Recurrent Neural Networks 20150503 ReNet A Recurrent Neural Network Based Alternative to Convolutional Networks 20150331 End-To-End Memory Networks 20150209 Gated Feedback Recurrent Networks 20150204 Spatial Transformer Networks 20140908 Recurrent Neural Network Regularization Image captioning 20170330 Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training 20170218 MAT A Multimodal Attentive Translator for Image Captioning 20161214 Optimization of image description metrics using policy gradient methods 20161212 Text guided Attention Model for Image Captioning 20161205 Recurrent Image Captioner Describing Images with Spatial-Invariant Transformation and Attention Filtering 20161203 Areas of Attention for Image Captioning 20160809 Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders 20160706 Sort Story Sorting Jumbled Images and Captions into Stories 20160701 Domain Adaptation for Neural Networks by Parameter Augmentation 20160620 Variational Autoencoder for Deep Learning of Images Labels and Captions 20160615 Image Caption Generation with Text-Conditional Semantic Attention 20160607 Encode Review and Decode Reviewer Module for Caption Generation 20160605 Multimodal Residual Learning for Visual QA 20160531 Attention Correctness in Neural Image Captioning 20160512 Movie Description 20160503 Improving Image Captioning by Concept-based Sentence Reranking 20160501 Delving Deeper into Convolutional Networks for Learning Video Representations 20160419 Show Attend and Tell Neural Image Caption Generation with Visual Attention 20160406 Improving LSTM-based Video Descriptionwith Linguistic Knowledge Mined from Text 20160404 Image Captioning with Deep Bidirectional LSTMs 20160330 Rich Image Captioning in the Wild 20160330 Dense Image Representation with Spatial Pyramid VLAD Coding of CNN for Locally Robust Captioning 20160328 Generating Visual Explanations 20160328 Attend Infer Repeat:Fast Scene Understanding with Generative Models 20160324 A Diagram Is Worth A Dozen Images 20160301 ORDER-EMBEDDINGS OFIMAGES AND LANGUAGE 20160228 Generating Visual Explanations 20151117 Deep Compositional Captioning Describing Novel Object Categories without Paired Training Data 20151111 Deep Multimodal Semantic Embeddings for Speech and Images 20151109 Generating Images From Captions With Attention 20151027 Learning Deep Representations of Fine-Grained Visual-Descriptions 20151026 Video Paragraph Captioning using Hierarchical RecurrentNeuralNetworks 20151013 Summarization based Video Caption via DeepNeuralNetworks 20150916 Guiding Long-Short Term Memory for Image Caption Generation 20150829 Multimodal Convolutional Neural Networks for Matching Image and Sentence 20150604 The Long Short Story of Movie Description 20150604 Jointly Modeling Embedding and Translation to Bridge Video and Language 20150420 Show and Tell A Neural Image Caption Generator 20150414 Deep Visual-Semantic Alignments for Generating Image Descriptions 20150303 Sequence to Sequence Video to Text 20150227 Describing Videos by Exploiting Temporal Structure 20150212 Phrase based Image Captioning 20150210 Show Attend and Tell Neural Image Caption Generation with Visual Attention Image generation 20170411 A neural representation of sketch drawings 20161214 VAE vs GAN 20160819 Pixel Recurrent Neural Networks 20160726 Semantic Image Inpainting with Perceptual and Contextual Losses 20160629 Towards Conceptual Compression 20160619 Generating Images Part by Part with Composite Generative Adversarial Networks 20160616 Conditional Image Generation with PixelCNN Decoders 20160610 Improved Techniques for Training GANs 20160610 Deep Directed Generative Models with Energy-Based Probability Estimation 20160605 Generative Adversarial Text to Image Synthesis 20160529 Generating images with recurrent adversarial networks 20160526 Domain-Adversarial Training of Neural Networks 20160526 Adversarial Autoencoders 20160320 Segmentation from Natural Language Expressions 20160229 Generating Images from Captions with Attention 20151119 Unsupervised Learning of Visual Structure using Predictive Generative Networks 20151109 Generating Images From Captions With Attention 20150216 DRAW A Recurrent Neural Network For Image Generation 20140610 Generative Adversarial Networks Visual question answering 20160926 The Color of the Cat is Gray 1 Million Full Sentences Visual Question Answering 20160620 DualNet Domain-Invariant Network for Visual Question Answering 20160612 Training Recurrent Answering Units with Joint Loss Minimization for VQA 20160605 Multimodal Residual Learning for Visual QA 20160509 Ask Your Neurons A Deep Learning Approach to Visual Question Answering 20160504 Leveraging Visual Question Answering for Image-Caption Ranking 20160420 Question Answering via Integer Programming over Semi-Structured Knowledge 20160406 A Focused Dynamic Attention Model for VQA 20160319 Generating Natural Questions About an Image 20160309 Image Captioning and Visual QuestionAnswering Based on Attributes and TheirRelated External Knowledge 20160304 Dynamic Memory Networks for Visual and Textual Question Answering 20160208 Visualizing and Understanding Neural Models in NLP 20160202 Where To Look Focus Regions for Visual Question Answering 20151209 MovieQA Understanding Stories in Movies through Question-Answering 20151123 Where to look Focus regions for visual question answering 20151118 Learning to Answer Questions From Image Using Convolutional Neural Network 20151118 Compositional Memory for Visual Question Answering 20151118 An attention based convolutional neural network for visual question answering 20151117 Ask, Attend and Answer Exploring question-guided spatial attention for visual question answering 20151112 LSTM-based Deep Learning Models for Non-factoid Answer Selection 20151111 Visual7W Grounded Question Answering in Images 20151109 Explicit Knowledge-based Reasoning for Visual Question Answering 20151107 Stacked attention networks for image question answering 20151107 Simple Baseline for Visual Question Answering 20150521 Are you talking to a machine? dataset and methods for multilingual image question answering 20150508 Exploring Models and Data for Image Question－Answering 20150508 Exploring Models and Data for Image Question Answering 20150505 Ask Your Neurons A Neural-based Approach to Answering Questions about Images 20150503 VQA Visual Question Answering Natural language processing 20170330 Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training 20170324 Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech 20170302 Controllable Text generation 20170208 A Hybrid Convolutional Variational Autoencoder for Text Generation 20161220 Hierarchical Softmax 20160818 Full Resolution Image Compression with Recurrent Neural Networks 20160803 Learning Online Alignments with Continuous Rewards Policy Gradient 20160803 Dependency-based Convolutional Neural Networks 20160726 An Actor-Critic Algorithm for Sequence Prediction 20160722 Syntax-based Attention Model for Natural Language Inference 20160718 Neural Machine Translation with Recurrent Attention Modeling 20160715 Neural Tree Indexers for Text Understanding 20160715 Neural Machine Translation with Recurrent Attention Modeling 20160715 Attention-over-Attention Neural Networks for Reading Comprehension 20160710 CHARAGRAM Embedding Words and Sentences via Character n-grams 20160705 Chains of Reasoning over Entities Relations and Text using Recurrent Neural Networks 20160703 快进连接改进Attention机制 深度学习提升机器翻译效果 20160621 Topic Augmented Neural Response Generation with a Joint Attention Mechanism 20160615 The Enemy in Your Own Camp How Well Can We Detect Statistically-Generated Fake Reviews–An Adversarial Study 20160613 Attention-based Multimodal Neural Machine Translation 20160607 Memory-enhanced Decoder for Neural Machine Translation 20160606 Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification 20160606 A Decomposable Attention Model for Natural Language Inference 20160605 Deep Reinforcement Learning for Dialogue Generation 20160524 Hierarchical Memory Networks 20160524 Combining Recurrent and Convolutional Neural Networks for Relation Classification 20160509 Parse tree 20160428 Crafting Adversarial Input Sequences for Recurrent Neural Networks 20160407 Sentence Level Recurrent Topic Model Letting Topics Speak for Themselves 20160406 A Recurrent Latent Variable Model for Sequential Data 20160404 Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models 20160401 Building Machines That Learn and Think Like People 20160323 Latent Predictor Networks for Code Generation 20160322 Fully Convolutional Attention Localization Networks Efficient AttentionLocalization for Fine-Grained Recognition 20160221 Learning Semantic Representations using Relations 20160219 Contextual LSTM (CLSTM) models for Large scale NLP tasks 20160208 Efficient Algorithms for Adversarial Contextual Learning 20160207 Exploring the Limits of Language Modeling 20160206 WebNav A New Large-Scale Task for Natural Language based Sequential Decision Making 20160206 Recurrent Memory Network for Language Modeling 20160206 Multi-Way Multilingual Neural Machine Translation with a Shared Attention Mechanism 20160201 Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers 20160110 Strategies for Training Large Vocabulary Neural Language Models 20151227 Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews 20151215 Increasing the Action Gap New Operators for Reinforcement Learning 20151203 Target-Dependent Sentiment Classification with Long Short Term Memory 20151203 Neural Enquirer Learning to Query Tables in Natural Language 20151201 Multilingual Language Processing From Bytes 20151119 Multi-task Sequence to Sequence Learning 20151119 Alternative structures for character-level RNNs 20151111 Larger-Context Language Modeling 20151104 Semi-supervised Sequence Learning 20151101 A Unified Tagging Solution Bidirectional LSTM Recurrent Neural Network with Word Embedding 20151031 Top down Tree LSTM Networks 20151029 Attention with Intention for a Neural Network Conversation Model 20151026 Thinking on your Feet Reinforcement Learning for Incremental Language Tasks 20151013 A Sensitivity Analysis of Convolutional Neural Networks for Sentence Classification 20151011 A Diversity-Promoting Objective Function for Neural Conversation Models 20150902 A Neural Attention Model for Abstractive Sentence Summarization 20150822 Towards Neural Network-based Reasoning 20150726 Improved Semantic Representations From Tree-Structured Long Short TermMemory Networks 20150629 Document Embedding with Paragraph Vectors 20150626 On Using Very Large Target Vocabulary for Neural Machine Translation 20150622 Skip-Thought Vectors 20150619 Deep Knowledge Tracing 20150619 A Neural Conversational Model 20150617 Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models 20150610 Teaching Machines to Read and Comprehend 20150609 Scheduled sampling for sequence prediction with recurrent neural networks 20150531 A Neural Network Approach to Context Sensitive Generation of Conversational Responses 20150427 Neural Responding Machine for Short-Text Conversation 20150205 Character-level Convolutional Networks for Text Classification 20141223 Grammar as a Foreign Langauage 20141017 Learning to Execute 20140910 Sequence to Sequence Learning with Neural Networks 20140903 On the Properties of Neural Machine Translation Encoder-Decoder Approaches 20140901 Neural Machine Translation by Jointly Learning to Align and Translate 20140624 Recurrent Models of Visual Attention 20140603 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation 20140516 Distributed Representations of Sentences and Documents 20050909 METEOR An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments 20010917 Bleu a method for automatic evaluation of machine translation ←Convolutional Neural Networks Reinforcement Learning→
论文阅读：Prominent Object Detection and Recognition: A Saliency-based Pipeline 如上图所示，本文旨在解决一个问题：给定一张图像，我们最应该关注哪些区域？怎么将其分割出来？这是一个什么东东？这三个子问题为一体。 Problem formulation: Given an image, determine the most influential item in the scene in terms of region of interest, pixel-level extent (segmentation), and object type. 作者提出的框架为： 可以发现，上述流程图就是各个需要完成任务的级联和组合。 网络的训练也是各个级别依次完成的。没啥好说的。 有些实验结果看起来还是挺烂的： 有的看起来还可以，但是也不怎么好：
物体检测算法 SSD 的训练和测试 GitHub：https://github.com/stoneyang/caffe_ssd Paper： https://arxiv.org/abs/1512.02325 1. 安装 caffe_SSD： git clone https://github.com/weiliu89/caffe.git cd caffe git checkout ssd 2. 编译该 caffe 文件，在主目录下： # Modify Makefile.config according to your Caffe installation. cp Makefile.config.example Makefile.config make -j24 # Make sure to include $CAFFE_ROOT/python to your PYTHONPATH. make pycaffe # Then, you need to export your Python path into the environment. This Step is important, it may shown you error, if you skip this operation. export PYTHONPATH=/home/wangxiao/Documents/caffe/python:$PYTHONPATH 但是，事情总是没那么顺利啊，不然你也不会在这里看我瞎bb了。编译过程中，会遇到这个bug：json_parser_read.hpp:257:264: error: ‘type name’ declared as function returning an array escape 然后，你想继续玩这个SSD，就得执行如下操作，以继续编译该caffe文件： 修改json_parser_read.hpp：打开文件夹Document，选中computer，在搜索json_parser_read.hpp，找到该文件的路径之后用如下命令打开 sudo gedit /usr/include/boost/property_tree/detail/json_parser_read.hpp 将257行开始的escape代码段注释掉即可，如下： /*escape = chset_p(detail::widen<Ch>("\"\\/bfnrt").c_str()) [typename Context::a_escape(self.c)] | 'u' >> uint_parser<unsigned long, 16, 4, 4>() [typename Context::a_unicode(self.c)] ;*/ 3. 编译完成后，开始下载作者使用的 在 ImageNet 上预训练好的 VGG-16 模型： Download fully convolutional reduced (atrous) VGGNet. By default, we assume the model is stored in $CAFFE_ROOT/models/VGGNet/ 4. 下载训练测试用的数据集：Pascal VOC 2007 2012: # Download the data. cd $HOME/data wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar # Extract the data. tar -xvf VOCtrainval_11-May-2012.tar tar -xvf VOCtrainval_06-Nov-2007.tar tar -xvf VOCtest_06-Nov-2007.tar 5. 将这些文件打包处理，生成 lmdb 文件： ./data/VOC0712/create_list.sh ./data/VOC0712/create_data.sh 6. 数据集处理完毕后，我们就可以修改相关的参数以及路径等，使得在我们自己的机器上可以爽快的运行： vim /examples/ssd/ssd_pascal.py 主要包括： 1. 训练数据集 lmdb 的路径： 2. 测试数据集 lmdb 的路径： 3. gpus=”0,1,2,3” ===> 改为”0” 4. batchsize = 32 ==>> 改为 20 比较好，因为有可能会显存溢出； 7. 将以上几点都注意到，应该不会再出问题的了，目测我的已经训练到第 360 次迭代了。。。 以上就是 SSD的训练部分。 Reference： 1. http://blog.csdn.net/tfy1028/article/details/53289106 2. http://blog.csdn.net/zhang_shuai12/article/details/52346878
Deep Learning framework --- MexNet 安装，测试，以及相关问题总结 一、安装： 参考博文：http://www.open-open.com/lib/view/open1448030000650.html Note： gcc g++ 需要 4.8 版本。 二、
Install and Compile MatConvNet: CNNs for MATLAB --- Deep Learning framework 2017-04-18 10:19:35 If you want to use matlab convnet, you just install according to the following tutorials: 1. Download and unzip the original source file from: http://www.vlfeat.org/matconvnet/ 2. Then, install and compile this file: > cd <MatConvNet> > addpath matlab > vl_compilenn 3. If you want to use GPU, you need to compile with : > vl_compilenn('enableGpu', true, 'cudaRoot', '/usr/local/cuda-8.0', 'cudaMethod', 'nvcc') 4. then test if you have install it successfully. > vl_testnn('gpu', true) Useful Tips: 1. Do not use the 1.0-beta20 , because it may contain BUG ! I encounter this error with this version. then, I changed into 1.0-beta24 , everything become easy ... you know ...
Learning Cross-Modal Deep Representations for Robust Pedestrian Detection 2017-04-11 19:40:22 Motivation： 本文主要是考虑了在光照极端恶劣的情况下，如何充分的利用 thermal data 进行协助学习提升 可见光图像的 特征表达能力，而借鉴了 ICCV 2015 年的一个文章，称为：监督迁移的方法，以一种模态的特征为 label，以监督学习的方式实现无监督学习。说到这里可能比较让人糊涂，什么叫：以监督学习的方式实现无监督学习？说道监督学习，因为这里 training RGB modal 是以监督学习的方式进行训练的，因为标签是以 thermal 提取出来的特征为调整的目标（称为 target label）。说到无监督学习，其实这里没有用到人工标注的数据，只是用到了网络提取出来的 thermal feature, 而这就是比较好的地方了。这也是那个 Supervised Transfer 文章的主要卖点，而这里作者将其应用到 multi-modal 的情况下。 所以，很自然的就可以知道网络的大致设计： 1. 首先要有特征的跨模态迁移，算是第一阶段： 那么，可以看到上图就是刚刚讲的 监督学习的方式进行特征迁移的过程。 2. 有了这个增强的特征，我们就可以利用这个进行黑暗环境下的行人检测了： 看到这个网络的设计，主要是原始特征和后续增强特征的组合了，然后进行最终的 bounding box regression 以及 softmax 分类。 从而完成整个行人检测算法。效果之所以有提升，主要在于第二个网络提供了更好的 黑暗环境下从 thermal data 那里学习到的 feature。 这就是文章的主题思想了。。。。 作者在两个数据集上做了相关的实验。。。具体结果见原文了。 Reference: 1. Learning Cross-Modal Deep Representations for Robust Pedestrian Detection. In CVPR, 2017. 2. S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016. 3. J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In CVPR, 2016
Latex 经常见到的问题和解决方法 2017-04-10 22:05:48 1. IEEE 模板添加 通信作者的时候，总是无法正常显示（脚注无法正常显示）： 因为 IEEE默认是屏蔽了这个功能，需要在编辑的 .tex 文件中，添加这么一句话： \IEEEoverridecommandlockouts 然后就是： 生成的 pdf文件就是正常的了： 2. 参考文献引用过程中，经常遇到 & 符号的问题，例如： @article{Wang2016Person, title={Person Re-Identification by Discriminative Selection in Video Ranking}, author={Wang, Taiqing and Gong, Shaogang and Zhu, Xiatian and Wang, Shengjin}, journal={IEEE Transactions on Pattern Analysis & Machine Intelligence}, volume={38}, number={12}, pages={1-1}, year={2016}, } 这里面的 & 符号在latex 中是无法直接编译过去的，需要用转义字符 \& 或者直接将其删除。 另一个需要注意的地方是：latex调用 .bib 参考文献的时候，需要四个步骤： ==>> pdflatex, bibtex, pdflatex, pdflatex 这样子，才可以在生成的 pdf 文件中，得到正确的参考文献显示。 3. The Textlive in linux operation, you can install textlive 2017 and texmaker from software center. $ sudo mount -a /path/to/your/textlive/ /home/wangxiao/textlive/ $ cd /home/wangxiao/textlive/ $ sudo ./install-tl -gui ## seting the environment: sudo gedit ~/.bashrc and add the following lines into your file. export MANPATH=${MANPATH}:/usr/local/texlive/2016/texmf-dist/doc/man export INFOPATH=${INFOPATH}:/usr/local/texlive/2016/texmf-dist/doc/info export PATH=${PATH}:/usr/local/texlive/2016/bin/x86_64-linux Configure your texmaker according to above figures. 4. Latex 排版带有大括号 {} 公式的方法： $$ f(x)=\left\{ \begin{aligned} x & = & \cos(t) \\ y & = & \sin(t) \\ z & = & \frac xy \end{aligned} \right. $$ $$ F^{HLLC}=\left\{ \begin{array}{rcl} F_L & & {0 < S_L}\\ F^*_L & & {S_L \leq 0 < S_M}\\ F^*_R & & {S_M \leq 0 < S_R}\\ F_R & & {S_R \leq 0} \end{array} \right. $$ $$f(x)= \begin{cases} 0& \text{x=0}\\ 1& \text{x!=0} \end{cases}$$ 效果分别为：
The frequently used operation in Linux system 2017-04-08 12:48:09 1. mount the hard disk: #: fdisk -l %% use this operation to check how many and what disk it found in the computer. #: mkdir yiDongYingPan %% mkdir a new file as the location to mount i.e. take all the files from your disk into this file . #: mount /dev/sdc1 /home/wx/data/yiDongYingPan/ then, you can open file: yiDongYingPan to operate your file now. 2. clear the RAM: #: free -m %% to check the memory usage of current PC #: sync #: echo 3 > /proc/sys/vm/drop_caches 3. when new system installed the Yakuake, it often shown me the error like followings: Yakuake was unable to load the Konsole component. A Konsole installation is required to use Yakuake. ==>> Solution: ??? 4. Unable to access “605 GB Volume” Error mounting /dev/sda5 at /media/wangxiao/E1F171026416B63F: Command-line `mount -t "ntfs" -o "uhelper=udisks2,nodev,nosuid,uid=1000,gid=1000,dmask=0077,fmask=0177" "/dev/sda5" "/media/wangxiao/E1F171026416B63F"' exited with non-zero exit status 14: The disk contains an unclean file system (0, 0).Metadata kept in Windows cache, refused to mount.Failed to mount '/dev/sda5': Operation not permittedThe NTFS partition is in an unsafe state. Please resume and shutdownWindows fully (no hibernation or fast restarting), or mount the volumeread-only with the 'ro' mount option. ==>> sudo ntfsfix /dev/sda5 5.
Speech and Natural Language Processing obtain from this link: https://github.com/edobashira/speech-language-processing A curated list of speech and natural language processing resources. Other lists can be found in this list. If you want to contribute to this list (please do), send me a pull request. All Sub-caterogires are listed in alphabetical order Finite State Toolkits and Regular Expressions AT&T FSM Library The AT&T FSM libraryTM is a set of general-purpose software tools available for Unix, for building, combining, optimizing, and searching weighted finite-state acceptors and transducers. Carmel Finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests/ Categorial semiring Categorial semiring as described in Sproat et al. 2014 dk.brics.automaton Java toolkit for FSAs and regular expression. Fare Fare is a finite state and regular expression libary for the .NET framework written in C#. am is a JavaScript library for working with automata and formal grammars for regular and context-free languages Foma Finite-state compiler and C library fsa Toolkit used in RWTH ASR engine fsm2.0 Thomas Hanneforths fsm 2.0 library written C++ has a few nice operations such as three-way composition fstrain A toolkit for training finite-state models jopenfst Java port of the C++ OpenFst library; originally forked from the CMU Sphinx project Kleene programming language High level finite state programming language built on top of OpenFst. MIT FST Toolkit WFST toolkit no maintained anymore but feature a few commands not found in other toolkits MoMs-for-StochasticLanguages Spectral and other training algorithms for WFSAs. n Shortest Path for PDT n Shortest Path for PDT Noam "Noam is a JavaScript library for working with automata and formal grammars for regular and context-free languages". Also has pretty cool examples using viz.js OpenFst OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). openfst-utils Nice set of utilities for OpenFst includes implementation of Categorial semirings.openfst-utils. openlat Toolkit for manipulating word lattice built on top of OpenFst. Includes support for reading and writing HTK compatible lattices. PyFst Python interface to OpenFst SFST - Stuttgart Finite State Transducer Tools "SFST is a toolbox for the implementation of morphological analysers and other tools which are based on finite state transducer technology." Treba "Treba is a basic command-line tool for training, decoding, and calculating with weighted (probabilistic) finite state automata (PFSA) and Hidden Markov Models (HMMs)." Many of the toools in the machine translation section also implement interesting graph and semiring operations. Language Modelling Toolkits Bayesian Recurrent Neural Network for Language Modeling This is a C/C++ implementation for Bayesian recurrent neural network for language modeling (BRNNLM) Berkeley LM Bigfatlm Provides Hadoop training of Kneser-ney language models, written in Java. CSLM "Continuous Space Language Model toolkit. CSLM toolkit is open-source software which implements the so-called continuous space language model. DALM Double array language model. KenLM Kenneth Heafield's language model toolkit, uses a very fast and low memory representation. lwlm lwlm is an exact, full Bayesian implementation of the Latent Words Language Model (Deschacht and Moens, 2009). Maximum Entropy Modeling Le Zhang has a comprehensive set of links related MaxEnt models. Maximum entropy language models: SRILM extension "This patch adds the functionality to train and apply maximum entropy (MaxEnt) language models to the SRILM toolkit. Currently, only N-gram features are supported" mitlm My personal favourite LM toolkit, super fast and seems to get slightly higher accuracy. MSRLM "This scalable language-model tool is used to build language models from large amounts of data. It supports modified absolute discounting and Kneser-Ney smoothing." OpenGrm Language modelling toolkit for use with OpenFst. cpyp C++ library for modeling with Pitman-Yor processes RandLM Bloom filter based random language models RNNLM Recurrent neural network language model toolkit. Refr Re-ranking framework from the Johns-Hopkins workshop on confusion language modelling. rwthlm A toolkit for training neural network language models (feedforward, recurrent, and long short-term memory neural networks). The software was written by Martin Sundermeyer. SRILM Very popular toolkit, source code avaliable but only non-free for commerical use. Speech Recognition AaltoASR Aalto Automatic Speech Recognition tools Barista Barista is an open-source framework for concurrent speech processing. Bavieca New open source toolkit featuring static and dynamic decoders. kaldi-nnet-dur-model Neural network phone duration model on top of the Kaldi speech recognition framework, (Interspeech paper) CMU Sphinx Open Source Toolkit For Speech Recognition Project by Carnegie Mellon University HTK "The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models." Juicer Juicer is a Weighted Finite State Transducer (WFST) based decoder for Automatic Speech Recognition (ASR). Julius "Julius is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers." Kaldi Modern open source toolkit lead by Dan Povey featuring many state-of-the-art techniques. OpenDcd An Open Source WFST based Speech Recognition Decoder. Phonetisaurus Josef Novak's super fast WFST based Phoneticizer, site also has some really nice tutorials slides. Sail Align SailAlign is an open-source software toolkit for robust long speech-text alignment implementing an adaptive, iterative speech recognition and text alignment scheme that allows for the processing of very long (and possibly noisy) audio and is robust to transcription errors. It is mainly written as a perl library but its functionality also depends… SCARF: A Segmental CRF Toolkit for Speech Recognition "SCARF is a toolkit for doing speech recognition with segmental conditional random fields." trainc David Rybach and Michael Riley's tool for direct construction of context-dependency transducers (Interspeech best paper). RASR RWTH ASR - The RWTH Aachen University Speech Recognition System Signal Processing An Interactive Source Separation Editor "ISSE is an open-source, freely available, cross-platform audio editing tool that allows a user to perform source separation by painting on time-frequency visualisations of sound." Bob Bob is a free signal-processing and machine learning toolbox originally developed by the Biometrics group at Idiap Research Institute, in Switzerland. Matlab Audio Processing Examples SAcC - Subband Autocorrelation Classification Pitch Tracker "SAcC is a (compiled) Matlab script that performs noise- robust pitch tracking by classifying the autocorrelations of a set of subbands using an MLP neural network." Text-to-Speech HTS HMM-based speech synthesis RusPhonetizer Grammar rules and dictionaries for the phonetic transcription of Russian sentences Speech Data cmudict CMUdict (the Carnegie Mellon Pronouncing Dictionary) is a free pronouncing dictionary of English. LibriSpeech ASR corpus LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. TED-LIUM Corpus The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website. Machine Translation Berkeley Aligner "...a word alignment software package that implements recent innovations in unsupervised word alignment." cdec "Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms" Jane "Jane is RWTH's open source statistical machine translation toolkit. Jane supports state-of-the-art techniques for phrase-based and hierarchical phrase-based machine translation." Joshua Hierarchical and syntax based machine translation decoder written in Java. Moses Standard open source machine translation toolkit. alignment-with-openfst zmert Nice Java Mert implementation by Omar F. Zaidan Machine Learning BIDData BIDMat is a matrix library intended to support large-scale exploratory data analysis. Its sister library BIDMach implements the machine learning layer. libFM: Factorization Machine Library sofia-ml Fast incremental learning algorithms for classification, regression, ranking from Google. Spearmint Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper: Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek, Hugo Larochelle and Ryan P. Adams Advances in Neural Information Processing Systems, 2012 Deep Learning Benchmarks - Comparison of different convolution network implementations. Cafee - Really active deep learning toolkit with support for cuDNN and lots of other backends. cuDNN - Deep neural network from Nvidia with paper here. Torch 7 has support for cuDnn and here are some Python wrappers. CURRENNT - Munich Open-Source CUDA RecurREnt Neural Network Toolkit described in this paper gensim - Python topic modeling toolkit with word2vec implementation. Extremly easy to use and to install. Glove Global vectors for word representation. GroundHog Neural network based machine translation toolkit. KALDI LSTM C++ implementation of LSTM (Long Short Term Memory), in Kaldi's nnet1 framework. Used for automatic speech recognition, possibly language modeling etc. OxLM: Oxford Neural Language Modelling Toolkit Neural network toolkit for machine translation described in the paper here Neural Probabilistic Language Model Toolkit "NPLM is a toolkit for training and using feedforward neural language models (Bengio, 2003). It is fast even for large vocabularies (100k or more): a model can be trained on a billion words of data in about a week, and can be queried in about 40 μs, which is usable inside a decoder for machine translation." RNNLM2WFST Tool to convert RNNLMs to WFSTs ViennaCL <http://viennacl.sourceforge.net/> - ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. Natural Language Processing BLLIP reranking parser "BLLIP Parser is a statistical natural language parser including a generative constituent parser (first-stage) and discriminative maximum entropy reranker (second-stage)." OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. SEAL Set expander for any language described in this paper Stanford CoreNLP "Stanford CoreNLP provides a set of natural language analysis tools written in Java" Applications Cloud ASR using PyKaldi "CloudASR is a software platform and a public ASR webservice." Other Tools GraphViz.sty Really handy tool adding dot languge directly to a LaTex document, useful for tweaking the small colorized WFST figure in papers and presentations. Blogs Between One and Zero by William Hartmann cmusphinx CMU Sphinx related blog Language Log LingPipe Blog Natural Language Processing and Text Analytics Natural Language Processing Blog by Hal Daumé III Spoken Language Processing "Some thoughts on Spoken Language Processing, with tangents on Natural Language Processing, Machine Learning, and Signal Processing thrown in for good measure." Books DEEP LEARNING: Methods and Applications By Li Deng and Dong Yu Foundations of Data Science Draft by John Hopcroft and Ravindran Kannan Introduction to Matrix Methods and Applications (Working Title) S. Boyd and L. Vandenberghe
Optical Flow related Tutorials 2017-04-01 10:50:55 Reference: 1. http://blog.csdn.net/carson2005/article/details/7581642 2.
The issus in Age Progression/Regression by Conditional Adversarial Autoencoder (CAAE) Today I tried a new project named: Face-Aging-CAAE Paper Name: Age Progression/Regression by Conditional Adversarial Autoencoder (CAAE) Github: https://github.com/ZZUTK/Face-Aging-CAAE But count some issues before I run the code successfully. Maybe it caused by the version of tensorflow. 1. TypeError: Expected int32, got list containing Tensors of type '_Message' instead. 2. ValueError: Only call 'sigmoid_cross_entropy_with_logits' with named arguments (labels=..., logits=..., ...) 3. ValueError: Variable E_conv0/w/Adam/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope ? The follow changes are needed for this code to solve above issues. Then, you will see the process of training:
干货 | 图解LSTM神经网络架构及其11种变体（附论文） 2016-10-02 机器之心 选自FastML 作者：Zygmunt Z. 机器之心编译 参与：老红、李亚洲 就像雨季后非洲大草原许多野生溪流分化成的湖泊和水洼，深度学习已经分化成了各种不同的专门架构。 并且，每个架构都会有一个图解，这里将详细介绍它们。 神经网络在概念上很简单，并且它们十分动人。在层级上，有着一堆同质化的元素和统一的单位，并且它们之间还存在在一系列的加权连接。这就是神经网络的所有，至少从理论上来说是这样。然而，时间证明的结果却有所不同。并非工程的特性，我们现在拥有的是建筑工程，而非工程的特性，正如 Stephen Merrity 描述的那样： 深度学习的浪漫主义描述通常预示着手工制作工程特性的日子一去不复返了，这个模型的本身是足以先进到能够解决问题的。正如大多数广告一样，它同时具备真实性和误导性。 虽然深度学习在很多情况下简化了工程特性，但它肯定还没有彻底地摆脱它。随着工程特性的减少，机器学习模型本身的结构变得越来越复杂。大多数时候，这些模型架构会特定于一个给定的任务，就像过去的工程特性那样。 需要澄清一下的是，这仍然是很重要的一步。结构工程要比工程特性更具一般性，并且提供了许多新的机会。正如我们提到的，我们不能无视这样一个事实：我们离我们想要达到的还很远。 LSTM 图解 怎样解释这些架构？自然地，我们可以通过图解，图解往往可以让阐述变得更清晰。 让我们先来看看如今最流行的两种网络，CNN 和 LSTM： 很简单吧，我们再更仔细地研究下： 正如大家所言，你可能有很多不理解的数学问题，但你会慢慢习惯它们。幸运地是，我们有很多非常好的解释。 仍觉得 LSTM 太复杂了？那让我们来试试简单的版本，GRU (Gated Recurrent Unit)，相当琐碎。 尤其是这一个，被称为 minimal GRU： 更多图解 LSTM 个多各样的变体如今很常见。下面就是一个，我们称之为深度双向 LSTM： DB-LSTM（参见论文：End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks ） 剩下的也不需要加以过多说明。让我们从 CNN 和 LSTM 的结合开始说起： 卷积残差记忆网络（参见论文：Convolutional Residual Memory Networks） 动态 NTM（参见论文：Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes） 可发展神经图灵机（参见论文：Evolving Neural Turing Machines for Reward-based Learning） 视觉注意的循环模型（参见论文：Recurrent Models of Visual Attention） 通过反向传播无监督域适应（参见论文：Unsupervised Domain Adaptation by Backpropagation） 进行图像超分辨率的深度递归 CNN（参见论文：Deeply-Recursive Convolutional Network for Image Super-Resolution） 带有合成梯度的多层感知器的图解在清晰度上得分很高： 带有合成梯度的 MLP（参见论文：Decoupled Neural Interfaces using Synthetic Gradients） 每天都有新的成果出现，下面这个就是新鲜的，来自谷歌的神经机器翻译系统： 一些完全不同的东西 Neural Network ZOO（一篇描述神经网络架构的文章，机器之心同样进行了编译） 的描绘非常简单，但很多都华而不实，例如：ESM, ESN 和 ELM。 它们看上去像没有完全连接的感知器，它们看上去像没有完全连接的感知器，但它们应该代表的是一种液体状态机、一个回声状态网络和一个极端学习机。 LSM 和 ESN 有何不同？很简单，LSM 有着三角状绿色的神经元。而 ESN 和 ELM 又有什么不同呢？它们都有蓝色的神经元。 讲真，虽然类似，,ESN 是一个递归网络而 ELM 则不是。而这种区别也可在架构图中见到。