4 Handwriting Prediction 笔迹的预测
To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.
为了测试预测网络是否也能被用来生成令人信服的实值序列,我们将其应用于在线手写数据(在这种情况下,在线意味着书写被记录为笔尖位置的序列,而离线手写则只有页面图像可用)。由于其低维性(每个数据点两个实数)和易于可视化,在线手写是一个有吸引力的序列生成选择。
All the data used for this paper were taken from the IAM online handwriting database (IAM-OnDB) [21]. IAM-OnDB consists of handwritten lines collected from 221 different writers using a ‘smart whiteboard’. The writers were asked to write forms from the Lancaster-Oslo-Bergen text corpus [19], and the position of their pen was tracked using an infra-red device in the corner of the board. Samples from the training data are shown in Fig. 9. The original input data consists of the x and y pen co-ordinates and the points in the sequence when the pen is lifted off the whiteboard. Recording errors in the x, y data was corrected by interpolating to fill in for missing readings, and removing steps whose length exceeded a certain threshold. Beyond that, no preprocessing was used and the network was trained to predict the x, y co-ordinates and the endof-stroke markers one point at a time. This contrasts with most approaches to handwriting recognition and synthesis, which rely on sophisticated preprocessing and feature-extraction techniques. We eschewed such techniques because they tend to reduce the variation in the data (e.g. by normalising the character size, slant, skew and so-on) which we wanted the network to model. Predicting the pen traces one point at a time gives the network maximum flexibility to invent novel handwriting, but also requires a lot of memory, with the average letter occupying more than 25 timesteps and the average line occupying around 700. Predicting delayed strokes (such as dots for ‘i’s or crosses for ‘t’s that are added after the rest of the word has been written) is especially demanding.
本文使用的所有数据均来自IAM在线手写数据库(IAM- ondb)[21]。IAM-OnDB由使用“智能白板”从221位不同作者那里收集的手写行组成。作家们被要求写来自lancaster - oso - bergen文本文集[19]的表格,他们的笔的位置通过黑板角落里的红外线设备进行跟踪。训练数据的样本如图9所示。原始输入数据包括x和y笔坐标,以及当笔从白板上拿起时的顺序中的点。记录x、y数据中的错误是通过内插来填补缺失的读数,并删除长度超过某个阈值的步骤来纠正的。除此之外,没有使用预处理,网络被训练来预测x, y坐标和内源性卒中标记点一次一个点。这与大多数依赖于复杂的预处理和特征提取技术的手写识别和合成方法形成了对比。我们避免使用这些技术,因为它们会减少数据中的变化(例如,通过将字符大小、倾斜度、歪斜度等正常化),而这些正是我们希望网络建模的。预测笔的轨迹是一次一个点,这给了网络最大的灵活性来创造新的笔迹,但也需要大量的内存,平均每个字母占用超过25个时间步长,平均一行占用大约700个时间步长。预测延迟的笔画(比如“i”的点,或者“t”的叉,这些都是在单词的其余部分都写完之后才加上去的)尤其困难。
IAM-OnDB is divided into a training set, two validation sets and a test set, containing respectively 5364, 1438, 1518 and 3859 handwritten lines taken from 775, 192, 216 and 544 forms. For our experiments, each line was treated as a separate sequence (meaning that possible dependencies between successive lines were ignored). In order to maximise the amount of training data, we used the training set, test set and the larger of the validation sets for training and the smaller validation set for early-stopping.
IAM-OnDB分为一个训练集、两个验证集和一个测试集,分别包含5364、1438、1518和3859个手写行,分别来自775、192、216和544个表单。在我们的实验中,每一行都被视为一个单独的序列(这意味着连续行之间可能的依赖关系被忽略了)。为了最大化训练数据量,我们使用训练集、测试集和较大的验证集进行训练,较小的验证集进行早期停止。
Figure 9: Training samples from the IAM online handwriting database. Notice the wide range of writing styles, the variation in line angle and character sizes, and the writing and recording errors, such as the scribbled out letters in the first line and the repeated word in the final line.
图9:来自IAM在线手写数据库的训练样本。注意书写风格的广泛变化,行角和字符大小的变化,以及书写和记录错误,如第一行中潦草的字母和最后一行中重复的单词。
The lack of independent test set means that the recorded results may be somewhat overfit on the validation set; however the validation results are of secondary importance, since no benchmark results exist and the main goal was to generate convincing-looking handwriting. The principal challenge in applying the prediction network to online handwriting data was determining a predictive distribution suitable for real-valued inputs. The following section describes how this was done.
缺乏独立的测试集意味着记录的结果可能在验证集上有些过拟合;然而,验证结果是次要的,因为没有基准测试结果存在,主要目标是生成令人信服的笔迹。将预测网络应用于在线手写数据的主要挑战是确定一个适用于实值输入的预测分布。下面的部分将描述如何实现这一点。
4.1 Mixture Density Outputs 混合密度输出
The idea of mixture density networks [2, 3] is to use the outputs of a neural network to parameterise a mixture distribution. A subset of the outputs are used to define the mixture weights, while the remaining outputs are used to parameterise the individual mixture components. The mixture weight outputs are normalised with a softmax function to ensure they form a valid discrete distribution, and the other outputs are passed through suitable functions to keep their values within meaningful range (for example the exponential function is typically applied to outputs used as scale parameters, which must be positive).
混合密度网络[2,3]的思想是利用神经网络的输出来参数化混合分布。输出的一个子集用于定义混合权重,而其余的输出用于参数化单独的混合组件。混合重量与softmax函数输出正常,确保它们形成一个有效的离散分布,和其他的输出是通过合适的函数来保持它们的值有意义的范围内(例如指数函数通常用于输出作为尺度参数,必须积极)。
Mixture density network are trained by maximising the log probability density of the targets under the induced distributions. Note that the densities are normalised (up to a fixed constant) and are therefore straightforward to differentiate and pick unbiased sample from, in contrast with restricted Boltzmann machines [14] and other undirected models. Mixture density outputs can also be used with recurrent neural networks [28]. In this case the output distribution is conditioned not only on the current input, but on the history of previous inputs. Intuitively, the number of components is the number of choices the network has for the next output given the inputs so far. 通过最大限度地提高目标在诱导分布下的概率密度,训练混合密度网络。请注意,密度是标准化的(到一个固定的常数),因此是直接的区分和挑选无偏的样本,相比之下,限制玻尔兹曼机[14]和其他无定向模型。混合密度输出也可用于递归神经网络[28]。在这种情况下,输出分布不仅取决于当前输入,而且取决于以前输入的历史。直观地说,组件的数量就是到目前为止给定输入的网络对下一个输出的选择的数量。
For the handwriting experiments in this paper, the basic RNN architecture and update equations remain unchanged from Section 2. Each input vector xt consists of a real-valued pair x1, x2 that defines the pen offset from the previous input, along with a binary x3 that has value 1 if the vector ends a stroke (that is, if the pen was lifted off the board before the next vector was recorded) and value 0 otherwise. A mixture of bivariate Gaussians was used to predict x1 and x2, while a Bernoulli distribution was used for x3. Each output vector yt therefore consists of the end of stroke probability e, along with a set of means µ j , standard deviations σ j , correlations ρ j and mixture weights π j for the M mixture components. That is
对于本文的笔迹实验,基本的RNN结构和更新方程与第2节保持不变。每个输入向量xt由一对实值x1, x2,定义了笔抵消从之前的输入,以及一个二进制x3,值1如果向量中风(也就是说,如果钢笔是起飞前的董事会下向量记录)和值0。采用二元高斯混合预测x1和x2,而x3采用伯努利分布。每个输出向量刘日东因此包括中风的概率e,连同一套意味着µj,标准差σj,相关性ρ为M j和混合权重π混合组件。这是
This can be substituted into Eq. (6) to determine the sequence loss (up to a constant that depends only on the quantisation of the data and does not influence network training):
可以代入式(6)确定序列损耗(可达常数,仅依赖于数据的量子化,不影响网络训练):
Figure 10: Mixture density outputs for handwriting prediction. The top heatmap shows the sequence of probability distributions for the predicted pen locations as the word ‘under’ is written. The densities for successive predictions are added together, giving high values where the distributions overlap.
图10:手写预测的混合密度输出。顶部的热图显示了“下”这个词写的时候,预测的笔位置的概率分布序列。连续预测的密度被加在一起,给出了分布重叠的高值。
Two types of prediction are visible from the density map: the small blobs that spell out the letters are the predictions as the strokes are being written, the three large blobs are the predictions at the ends of the strokes for the first point in the next stroke. The end-of-stroke predictions have much higher variance because the pen position was not recorded when it was off the whiteboard, and hence there may be a large distance between the end of one stroke and the start of the next.
The bottom heatmap shows the mixture component weights during the same sequence. The stroke ends are also visible here, with the most active components switching off in three places, and other components switching on: evidently end-of-stroke predictions use a different set of mixture components from in-stroke predictions.
从密度图中可以看到两种类型的预测:拼出字母的小斑点是正在书写笔画的预测,三个大斑点是在下一个笔画的第一个点的笔画末端的预测。笔划结束预测的方差要大得多,因为当笔离开白板时,没有记录笔的位置,因此在一次笔划结束和下一次笔划开始之间可能有很大的距离。
底部的热图显示了在相同的序列中混合成分的权重。笔划终点在这里也可以看到,最活跃的部分在三个地方关闭,其他部分打开:显然,笔划终点预测使用的是一组不同于笔划内预测的混合部分。
4.2 Experiments
Each point in the data sequences consisted of three numbers: the x and y offset from the previous point, and the binary end-of-stroke feature. The network input layer was therefore size 3. The co-ordinate offsets were normalised to mean 0, std. dev. 1 over the training set. 20 mixture components were used to model the offsets, giving a total of 120 mixture parameters per timestep (20 weights, 40 means, 40 standard deviations and 20 correlations). A further parameter was used to model the end-of-stroke probability, giving an output layer of size 121. Two network architectures were compared for the hidden layers: one with three hidden layers, each consisting of 400 LSTM cells, and one with a single hidden layer of 900 LSTM cells. Both networks had around 3.4M weights. The three layer network was retrained with adaptive weight noise [8], with all std. devs. initialised to 0.075. Training with fixed variance weight noise proved ineffective, probably because it prevented the mixture density layer from using precisely specified weights.
数据序列中的每个点由三个数字组成:前一个点的x和y偏移量,以及二进制行程结束特征。因此,网络输入层的大小为3。在训练集上,坐标偏移被归一化为均值0,标准偏差1。20个混合分量被用来对偏移进行建模,每个时间步共给出120个混合参数(20个权重,40个平均值,40个标准差和20个相关系数)。使用进一步的参数来建模行程结束的概率,输出层的大小为121。比较了两种隐含层的网络结构:一种是三层隐含层,每层包含400个LSTM单元,另一种是单个隐含层包含900个LSTM单元。两个网络的重量都在340万磅左右。采用自适应权值噪声[8]对三层网络进行再训练。初始化到0.075。用固定方差权值噪声进行训练被证明是无效的,可能是因为它阻止了混合密度层使用精确指定的权值。
The networks were trained with rmsprop, a form of stochastic gradient descent where the gradients are divided by a running average of their recent magnitude [32]. Define i = ∂L(x) ∂wi where wi is network weight i. The weight update equations were:
这些网络是用rmsprop进行训练的,rmsprop是一种随机梯度下降的形式,梯度除以其最近大小[32]的运行平均值。∂L(x)∂wi其中wi为网络权值i,权值更新方程为:
The output derivatives ∂L(x) ∂yˆt were clipped in the range [−100, 100], and the LSTM derivates were clipped in the range [−10, 10]. Clipping the output gradients proved vital for numerical stability; even so, the networks sometimes had numerical problems late on in training, after they had started overfitting on the training data.
Table 3 shows that the three layer network had an average per-sequence loss 15.3 nats lower than the one layer net. However the sum-squared-error was slightly lower for the single layer network. the use of adaptive weight noise reduced the loss by another 16.7 nats relative to the unregularised three layer network, but did not significantly change the sum-squared error. The adaptive weight noise network appeared to generate the best samples.
输出衍生品∂L (x)∂yˆt剪的范围(100−100)和LSTM衍生物被夹在[10−10日]。剪切输出梯度被证明对数值稳定性至关重要;即便如此,这些网络有时在训练后期会出现数字问题,那时它们已经开始对训练数据进行过度拟合。
表3显示,三层网络的每序列平均损失比一层网络低15.3纳特。而单层网络的平方和误差略低。与非正则化三层网络相比,使用自适应加权噪声减少了16.7纳特的损失,但并没有显著改变平方和误差。自适应权值噪声网络似乎产生了最好的样本。
4.3 Samples 样品
Fig. 11 shows handwriting samples generated by the prediction network. The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures.
图11为预测网络生成的笔迹样本。该网络显然已经学会了模仿笔划、字母甚至是简短的单词(尤其是“of”和“The”等常见单词)。它似乎还学会了一种基本的字符级语言模型,因为它发明的单词(“eald”、“bryoes”、“lenrest”)在英语中似乎有些可信。考虑到平均字符占用超过25个时间步长,这再次证明了该网络生成连贯的远程结构的能力。
5 Handwriting Synthesis 字合成
Handwriting synthesis is the generation of handwriting for a given text. Clearly the prediction networks we have described so far are unable to do this, since there is no way to constrain which letters the network writes. This section describes an augmentation that allows a prediction network to generate data sequences conditioned on some high-level annotation sequence (a character string, in the case of handwriting synthesis). The resulting sequences are sufficiently convincing that they often cannot be distinguished from real handwriting. Furthermore, this realism is achieved without sacrificing the diversity in writing style demonstrated in the previous section.
手写合成是生成给定文本的手写。显然,我们目前所描述的预测网络无法做到这一点,因为没有办法限制网络所写的字母。本节描述一种扩展,它允许预测网络根据某些高级注释序列(在手写合成的情况下是字符串)生成数据序列。由此产生的序列足以让人相信,它们常常无法与真实笔迹区分开来。此外,这种现实主义是在不牺牲前一节所展示的写作风格多样性的情况下实现的。
The main challenge in conditioning the predictions on the text is that the two sequences are of very different lengths (the pen trace being on average twenty five times as long as the text), and the alignment between them is unknown until the data is generated. This is because the number of co-ordinates used to write each character varies greatly according to style, size, pen speed etc. One neural network model able to make sequential predictions based on two sequences of different length and unknown alignment is the RNN transducer [9]. However preliminary experiments on handwriting synthesis with RNN transducers were not encouraging. A possible explanation is that the transducer uses two separate RNNs to process the two sequences, then combines their outputs to make decisions, when it is usually more desirable to make all the information available to single network. This work proposes an alternative model, where a ‘soft window’ is convolved with the text string and fed in as an extra input to the prediction network. The parameters of the window are output by the network at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.
调整对文本的预测的主要挑战是,这两个序列的长度非常不同(钢笔轨迹的平均长度是文本的25倍),在生成数据之前,它们之间的对齐是未知的。这是因为书写每个字符所用的坐标的数量会根据风格、大小、笔速等而变化很大。RNN传感器[9]是一种能够根据两种不同长度和未知排列的序列进行序列预测的神经网络模型。然而,使用RNN传感器进行手写合成的初步实验并不令人鼓舞。一种可能的解释是,传感器使用两个独立的rns来处理这两个序列,然后结合它们的输出来做出决策,而通常情况下,将所有信息提供给单一网络是更可取的。这项工作提出了一个替代模型,其中一个“软窗口”与文本字符串进行卷积,并作为一个额外的输入输入到预测网络中。窗口的参数是由网络在进行预测的同时输出的,因此它可以动态地确定文本和笔位置之间的对齐。简单地说,它学会决定接下来写哪个字符。
5.1 Synthesis Network 合成网络
Fig. 12 illustrates the network architecture used for handwriting synthesis. As with the prediction network, the hidden layers are stacked on top of each other, each feeding up to the layer above, and there are skip connections from the inputs to all hidden layers and from all hidden layers to the outputs. The difference is the added input from the character sequence, mediated by the window layer.
Given a length U character sequence c and a length T data sequence x, the soft window wt into c at timestep t (1 ≤ t ≤ T) is defined by the following discrete convolution with a mixture of K Gaussian functions
图12展示了用于手写合成的网络结构。与预测网络一样,隐藏层是堆叠在一起的,每一层向上向上,从输入到所有隐藏层,从所有隐藏层到输出都有跳跃连接。不同之处在于由窗口层调节的字符序列的附加输入。
给定一个长度为U的字符序列c和一个长度为T的数据序列x,在第T步(1≤T≤T)时,软窗口wt转化为c,定义为与K高斯函数混合的离散卷积
Note that the location parameters κt are defined as offsets from the previous locations ct−1, and that the size of the offset is constrained to be greater than zero. Intuitively, this means that network learns how far to slide each window at each step, rather than an absolute location. Using offsets was essential to getting the network to align the text with the pen trace.
注意κt被定义为位置参数较前位置偏移ct−1,偏移的大小限制是大于零的。直观地说,这意味着network了解在每一步中滑动每个窗口的距离,而不是绝对位置。使用偏移量对使网络将文本与钢笔轨迹对齐至关重要。
The wt vectors are passed to the second and third hidden layers at time t, and the first hidden layer at time t+1 (to avoid creating a cycle in the processing graph). The update equations for the hidden layers are
wt向量在t时刻传递到第二层和第三层隐含层,在t+1时刻传递到第一层隐含层(避免在处理图中创建一个循环)。隐层的更新方程为
5.2 Experiments 实验
The synthesis network was applied to the same input data as the handwriting prediction network in the previous section. The character-level transcriptions from the IAM-OnDB were now used to define the character sequences c. The full transcriptions contain 80 distinct characters (capital letters, lower case letters, digits, and punctuation). However we used only a subset of 57, with all digits and most of the punctuation characters replaced with a generic ‘nonletter’ label2 .
将合成网络应用于与前一节笔迹预测网络相同的输入数据。来自IAM-OnDB的字符级转录现在用于定义字符序列c。完整的转录包含80个不同的字符(大写字母、小写字母、数字和标点符号)。然而,我们只使用了57的一个子集,所有的数字和大部分的标点符号都被一个通用的“非字母”标签2所取代。
The network architecture was as similar as possible to the best prediction network: three hidden layers of 400 LSTM cells each, 20 bivariate Gaussian mixture components at the output layer and a size 3 input layer. The character sequence was encoded with one-hot vectors, and hence the window vectors were size 57. A mixture of 10 Gaussian functions was used for the window parameters, requiring a size 30 parameter vector. The total number of weights was increased to approximately 3.7M.
该网络结构与最佳预测网络尽可能相似:3个隐藏层,每个隐藏层有400个LSTM单元,输出层有20个二元高斯混合分量,输入层有3个大小。字符序列采用单热向量编码,因此窗口向量大小为57。窗口参数混合使用了10个高斯函数,需要一个大小为30的参数向量。总重量增加到约3.7M。
The network was trained with rmsprop, using the same parameters as in the previous section. The network was retrained with adaptive weight noise, initial standard deviation 0.075, and the output and LSTM gradients were again clipped in the range [−100, 100] and [−10, 10] respectively.
使用与前一节相同的参数,使用rmsprop对网络进行了训练。使用自适应权值噪声对网络进行再训练,初始标准偏差为0.075,再次将输出和LSTM梯度分别限制在[−100,100]和[−10,10]范围内。
Table 4 shows that adaptive weight noise gave a considerable improvement in log-loss (around 31.3 nats) but no significant change in sum-squared error. The regularised network appears to generate slightly more realistic sequences, although the difference is hard to discern by eye. Both networks performed considerably better than the best prediction network. In particular the sumsquared-error was reduced by 44%. This is likely due in large part to the improved predictions at the ends of strokes, where the error is largest
表4显示,自适应权值噪声在对数损失(约31.3 nats)方面有显著改善,但在平方和误差方面没有显著变化。这个规则化的网络似乎产生了更真实的序列,尽管这种差异很难用肉眼辨别。这两个网络都比最佳预测网络表现得好得多。特别是sumsquared-error减少了44%。这可能在很大程度上是由于改进了笔画末端的预测,在那里误差最大
5.3 Unbiased Sampling 公正的抽样
Given c, an unbiased sample can be picked from Pr(x|c) by iteratively drawing xt+1 from Pr (xt+1|yt), just as for the prediction network. The only difference is that we must also decide when the synthesis network has finished writing the text and should stop making any future decisions. To do this, we use the following heuristic: as soon as φ(t, U + 1) > φ(t, u) ∀ 1 ≤ u ≤ U the current input xt is defined as the end of the sequence and sampling ends. Examples of unbiased synthesis samples are shown in Fig. 15. These and all subsequent figures were generated using the synthesis network retrained with adaptive weight noise. Notice how stylistic traits, such as character size, slant, cursiveness etc. vary widely between the samples, but remain more-or-less consistent within them. This suggests that the network identifies the traits early on in the sequence, then remembers them until the end. By looking through enough samples for a given text, it appears to be possible to find virtually any combination of stylistic traits, which suggests that the network models them independently both from each other and from the text. ‘
给定c,从Pr(x|c)中迭代提取xt+1从Pr(xt+1|yt)中提取无偏样本,与预测网络一样。唯一不同的是,我们还必须决定合成网络何时完成文本的编写,并且应该停止做任何未来的决定。为此,我们使用以下启发式:一旦φ(t, U + 1) >φ(t, U)∀1≤≤U当前输入xt被定义为序列图和采样结束的结束。无偏合成样本的例子如图15所示。这些和所有后续的图像都是使用自适应权值噪声再训练的合成网络生成的。请注意,风格特征,如字符大小、倾斜度、曲线性等,在不同的样本之间差异很大,但在样本内部却或多或少保持一致。这表明,该网络在序列的早期识别出这些特征,然后记住它们,直到最后。通过对给定文本的足够多的样本进行研究,似乎有可能发现几乎任何风格特征的组合,这表明网络对它们进行独立建模,既相互独立,也独立于文本。”
Blind taste tests’ carried out by the author during presentations suggest that at least some unbiased samples cannot be distinguished from real handwriting by the human eye. Nonetheless the network does make mistakes we would not expect a human writer to make, often involving missing, confused or garbled letters3 ; this suggests that the network sometimes has trouble determining the alignment between the characters and the trace. The number of mistakes increases markedly when less common words or phrases are included in the character sequence. Presumably this is because the network learns an implicit character-level language model from the training set that gets confused when rare or unknown transitions occur. 作者在演讲中进行的盲品测试表明,至少有一些没有偏见的样品无法通过肉眼分辨出真迹。尽管如此,网络确实会犯一些我们不希望人类作家会犯的错误,通常包括丢失、混淆或含混的信件;这表明,网络有时难以确定字符和跟踪之间的对齐。当较不常见的单词或短语被包含在字符序列中时,错误数量显著增加。这可能是因为当罕见或未知的转换发生时,网络会从训练集中学习隐式的字符级语言模型。
5.4 Biased Sampling 有偏见的抽样
One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results. Define the probability bias b as a real number greater than or equal to zero. Before drawing a sample from Pr(xt+1|yt), each standard deviation σ j t in the Gaussian mixture is recalculated from Eq. (21) to
无偏样本的一个问题是它们往往难以阅读(部分原因是真实的笔迹难以阅读,部分原因是网络模型不完善)。直觉上,我们认为网络会给好笔迹更高的概率,因为它比糟糕的笔迹更平滑、更可预测。如果这是真的,我们应该输出更多可能的元素Pr(x|c),如果我们想让样本更容易阅读。对高概率样本的原则性搜索可能会导致一个困难的推理问题,因为每个输出的概率依赖于所有以前的输出。然而,一个简单的启发式,其中采样器是偏向于更可能的预测,在每一步独立,通常会给出良好的结果。将概率偏差b定义为大于或等于零的实数。之前图纸样本公关(xt + 1 |次),每一个标准差σj t在高斯混合重新计算从情商。(21)
This artificially reduces the variance in both the choice of component from the mixture, and in the distribution of the component itself. When b = 0 unbiased sampling is recovered, and as b → ∞ the variance in the sampling disappears and the network always outputs the mode of the most probable component in the mixture (which is not necessarily the mode of the mixture, but at least a reasonable approximation). Fig. 16 shows the effect of progressively increasing the bias, and Fig. 17 shows samples generated with a low bias for the same texts as Fig. 15.
这就人为地减少了混合中组分的选择和组分本身分布的差异。b = 0时的采样恢复,当b→∞方差在抽样消失,网络总是输出模式最可能的组件的混合物(不一定是混合的模式,但至少有一个合理的近似)。图16为逐步增大偏倚的效果,图17为与图15相同文本产生的低偏倚样本。
5.5 Primed Sampling 启动采样
Another reason to constrain the sampling would be to generate handwriting in the style of a particular writer (rather than in a randomly selected style). The easiest way to do this would be to retrain it on that writer only. But even without retraining, it is possible to mimic a particular style by ‘priming’ the network with a real sequence, then generating an extension with the real sequence still in the network’s memory. This can be achieved for a real x, c and a synthesis character string s by setting the character sequence to c 0 = c + s and clamping the data inputs to x for the first T timesteps, then sampling as usual until the sequence ends. Examples of primed samples are shown in Figs. 18 and 19. The fact that priming works proves that the network is able to remember stylistic features identified earlier on in the sequence. This technique appears to work better for sequences in the training data than those the network has never seen.
限制抽样的另一个原因是生成特定作者风格的笔迹(而不是随机选择的风格)。最简单的方法是只对那个编写器进行再培训。但是,即使不进行再训练,也可以通过用真实序列“启动”网络来模仿特定的样式,然后生成一个扩展,其中真实序列仍然保留在网络的内存中。这可以通过将字符序列设置为c 0 = c + s,并在第一个T时间步将数据输入固定到x,然后像往常一样采样,直到序列结束,从而实现对实际的x、c和合成字符串s的采样。启动样本的例子如图18和图19所示。启动起作用的事实证明,网络能够记住在序列前面识别出的文体特征。这种技术似乎比网络从未见过的序列更适合训练数据。
Primed sampling and reduced variance sampling can also be combined. As shown in Figs. 20 and 21 this tends to produce samples in a ‘cleaned up’ version of the priming style, with overall stylistic traits such as slant and cursiveness retained, but the strokes appearing smoother and more regular. A possible application would be the artificial enhancement of poor handwriting. 启动抽样和减少方差抽样也可以结合使用。如图20和图21所示,这往往会产生一种“净化”版的引语风格,保留了整体风格特征,如倾斜度和曲线感,但笔触看起来更平滑、更有规则。一种可能的应用是人为地改进拙劣的笔迹。
6 Conclusions and Future Work 结论与未来工作
This paper has demonstrated the ability of Long Short-Term Memory recurrent neural networks to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction. It has also introduced a novel convolutional mechanism that allows a recurrent network to condition its predictions on an auxiliary annotation sequence, and used this approach to synthesise diverse and realistic samples of online handwriting. Furthermore, it has shown how these samples can be biased towards greater legibility, and how they can be modelled on the style of a particular writer.
本文证明了长短时记忆递归神经网络利用下一步预测生成具有复杂、长时程结构的离散和实值序列的能力。它还引入了一种新颖的卷积机制,允许递归网络根据辅助注释序列调整预测,并使用这种方法来合成各种真实的在线手写样本。此外,它还展示了这些样本如何倾向于更大的易读性,以及如何模仿特定作者的风格。
Several directions for future work suggest themselves. One is the application of the network to speech synthesis, which is likely to be more challenging than handwriting synthesis due to the greater dimensionality of the data points. Another is to gain a better insight into the internal representation of the data, and to use this to manipulate the sample distribution directly. It would also be interesting to develop a mechanism to automatically extract high-level annotations from sequence data. In the case of handwriting, this could allow for more nuanced annotations than just text, for example stylistic features, different forms of the same letter, information about stroke order and so on. 未来的工作有几个方向。一是网络在语音合成中的应用,由于数据点的维数较大,语音合成可能比手写合成更具挑战性。另一种方法是更好地了解数据的内部表示形式,并使用它直接操纵样本分布。开发一种从序列数据中自动提取高级注释的机制也很有趣。在书写的情况下,这可能允许比文本更微妙的注释,例如风格特征,同一字母的不同形式,关于笔画顺序的信息等等。
Acknowledgements 致谢
Thanks to Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton and other colleagues at the University of Toronto for numerous useful comments and suggestions. This work was supported by a Global Scholarship from the Canadian Institute for Advanced Research.
感谢多伦多大学的Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton和其他同事提供了许多有用的意见和建议。这项工作得到了加拿大高级研究所的全球奖学金的支持。
References
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
[2] C. Bishop. Mixture density networks. Technical report, 1994.
[3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.
[4] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML’12), 2012.
[5] J. G. Cleary, Ian, and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32:396–402, 1984.
[6] D. Eck and J. Schmidhuber. A first look at music composition using lstm recurrent neural networks. Technical report, IDSIA USI-SUPSI Instituto Dalle Molle.
[7] F. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143, 2002.
[8] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, volume 24, pages 2348–2356. 2011.
[9] A. Graves. Sequence transduction with recurrent neural networks. In ICML Representation Learning Worksop, 2012.
[10] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013.
[11] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:602–610, 2005.
[12] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, volume 21, 2008.
[13] P. D. Gr¨unwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning). The MIT Press, 2007.
[14] G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report, 2010.
[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. 2001.
[16] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
[17] M. Hutter. The Human Knowledge Compression Contest, 2012. [18] K.-C. Jim, C. Giles, and B. Horne. An analysis of noise in recurrent neural networks: convergence and generalization. Neural Networks, IEEE Transactions on, 7(6):1424 –1438, 1996. [19] S. Johansson, R. Atwell, R. Garside, and G. Leech. The tagged LOB corpus user’s manual; Norwegian Computing Centre for the Humanities, 1986.
[20] B. Knoll and N. de Freitas. A machine learning perspective on predictive coding with paq. CoRR, abs/1108.3298, 2011.
[21] M. Liwicki and H. Bunke. IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages 956– 961, 2005.
[22] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313–330, 1993.
[23] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
[24] T. Mikolov, I. Sutskever, A. Deoras, H. Le, S. Kombrink, and J. Cernocky. Subword language modeling with neural networks. Technical report, Unpublished Manuscript, 2012.
[25] A. Mnih and G. Hinton. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems, volume 21, 2008.
[26] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.
[27] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Proc. ICASSP, 2013.
[28] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. pages 589–595. The MIT Press, 1999.
[29] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann machine. pages 1601–1608, 2008.
[30] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
[31] G. W. Taylor and G. E. Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In Proc. 26th Annual International Conference on Machine Learning, pages 1025–1032, 2009.
[32] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude, 2012.
[33] R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications, pages 433–486. 1995.