Interspeech 2017 | Self-adaptive Speech Recognition Technology

简介: Self-adaptive speech recognition improves the performance of ordinary speech recognition systems. In this article, let’s learn about the most recent d...

Robots_and_AI_peering_into_the_future

Introduction

Interspeech 2017 was held during August 20-24, 2017 in Stockholm, Sweden. Participants from various research institutes, universities and renowned companies used this platform to share their newest technologies, systems, and products. A high-profile team from Alibaba Group, a Diamond Sponsor of the conference, also joined the event. It was announced that from October 25 on, Alibaba iDST Voice Team and Alibaba Cloud Community will work together on a series of information sharing meetings regarding voice technology, in an effort to share the technological progress reported in Interspeech 2017. This article covers the session on Self-adaptive Speech Recognition Technology.

1. Self-adaptive Speech Recognition Technology

Self-adaptive speech recognition technology aims to improve the performance of speech recognition system for a certain speaker or domain. The purpose of this technology is to eliminate the decline in speech recognition performance due to variations in speakers or domains in the training and test sets. This variation mainly includes phonetics difference and those arising from pronunciation habits. Self-adaptive technology is used for products related to speech recognition technology and speech recognition of VIP customers.

1

The variability may decrease the performance of speaker/domain-independent recognition systems. If a related recognition system is trained for such speaker or domain, huge data collection is required, which is costly. Self-adaptive speech recognition technology offers a good trade-off, requires less data and offers good performance.

The self-adaptive speech recognition technologies can be classified into two categories according to their self-adaption spaces: self-adaptions in feature space and model space. The self-adaption in feature space tries to convert the relevant feature into irrelevant feature, thus matching with the irrelevant model. The self-adaption in model space tries to convert the irrelevant model into relevant model and thus match with the relevant feature. In short, the purpose of these two types of algorithms is to match the relevant feature with the irrelevant model.

2. Interspeech 2017 Paper Reading

2.1 Paper 1

2
This article titled "Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition" was presented by the University of Montréal. The main idea of this article is to convert the context-independent scaling and shifting parameters in layer normalization into context-dependent parameters and dynamically generate scaling and shifting parameters from the contextual information. This is called self-adaption in model space. The main innovative aspect lies in the fact that it does not require a self-adaptive stage (i.e., the self-adaption using the target domain enables learning knowledge on the target domain) or a relevant feature that includes speaker information, such as i-Vector.

3

The formula for the proposed DLN is shown in the diagram on the upper right. First, use the hidden layer or input vector hl−1thtl−1 of the same size as the minibatch (TT) of the previous layer for summarization and get alal. After that, use the matrix of a linear transformation and bias to dynamically control scaling (alphalgalphagl) and shifting (betalgbetagl) parameters.

Meanwhile, on the basis of the previous CE training, add a penalty term (LvarLvar in the diagram above) to increase the variance in the sentence and extract a more discriminative summarization.

4

This paper further summarizes the experiments on 81-hour WSJ and 212-hour TED datasets including 283 and 5,076 speakers, respectively.

At first, the performances of LN and DLN are compared for the WSJ dataset, mainly including FER (frame error rate) and WER (word error rate) of the development and test sets. The result shows that the performance of DLN is better than that of LN, except for the WER of the test set. This paper argues that the number of speakers in the WSJ dataset is less and therefore the variability among sentences is unobvious, and that the WSJ dataset is recorded in quiet environment and therefore the sentences are smooth and steady and the DLN cannot take effect.

As illustrated in the second diagram, the result of the TED dataset shows that the four performance parameters of DLN are better than those of LN. The comparison between the WSJ and TED datasets shows that the TED dataset has better performance because the TED dataset has more speakers and sentences than the WSJ dataset and therefore the variability is more obvious. In this paper, we see that the dynamical LN is related to the variability of sentences. And on the whole, DLN is better than LN.

2.2 Paper 2

5

This paper titled "Large-Scale Domain Adaptation via Teacher-Student Learning" is from Microsoft. The main idea of this paper is to employ teacher/student structure in domain adaptation. This method does not require labeled data of target the domain, but requires parallel data of the training set. The innovative and valuable aspect of this method is the fact that it uses very large amount of unlabeled data and the output of teacher network to further improve the performance of student model.

6

In the above diagram, "Teacher/student" is abbreviated as T/S. The flow chart of T/S training is illustrated on the upper right. In Figure 1, the teacher network is on the left and the student network is on the right, and the posterior probabilities of their output are set as PTPT and PSPS respectively.

The training process of student network is as follows:

  • First, clone the initial student network from the teacher network.
  • After that, use student domain data and teacher domain data to calculate posterior probabilities of the corresponding networks (PTPT and PSPS).
  • Finally, use these two posterior probabilities to calculate the error signal and do back propagation for the student network.

7

The experiment in this paper is based on the 375-hour Cortana data in English language. Different test sets are used for different domains.

For original/noisy environment, the experiment is conducted on the Cortana test set. At first, the teacher network is tested, and the test result shows that the performance of noisy speech is much worse than that of noise-free speech (18.8% vs. 15.62%). If a simulation approach is employed to train the teacher network, the test result shows improvement in the performance of noisy speech (17.34%), which is equivalent to the training on the student network using hard label. The T/S algorithm is used for Line 4 and Line 5; for the same quantity of data, the soft label is better than hard label (16.66% vs. 17.34%). Increase in the data of student network training to 3,400 hours would further enhance the performance (16.11%).

8

For adult/child experiment, first wipe off the 375-hour female and child data to generate an adult male model. The experiment result shows that the recognition performance of children is much worse, at 39.05 and 34.16, respectively. The usage of T/S algorithm in an original/noisy environment can further enhance the performance, and the expansion of data is beneficial to the performance.

2.3 Paper 3

9

This paper is provided by the Hong Kong University of Science and Technology and Google. The main idea and innovative aspect behind this article is application of self-adaptive approach of the Factorized Hidden Layer (FHL) to LSTM-RNN.

10

For the FHL adaptation algorithm, add a speaker-dependent web weight (WW) to the speaker-independent WW to generate the speaker-dependent web weight (WsWs). In formula (7), we can see that (B(1),B(2),...,B(i))(B(1),B(2),...,B(i)) is the set of the basis matrix for the SD transformation obtained through linear interpolation. Similarly, the neural network bias (bb) can be changed according to the speaker.

However, as the basis matrix brings in a large amount of parameters during the actual experiment, the basis matrix is constrained to be rank-1 and formula (7) is subject to alternation as shown on the upper right. As the basis matrix is rank-1, it can be expressed as a column vector (gamma(i)gamma(i)) multiplied by a row vector (psi(i)Tpsi(i)T). Meanwhile, the interpolation vector is expressed in the form of diagonal matrix (DsDs). The model training becomes easier with continued multiplication of GammaGamma, DsDs and PsiTPsiT.

11

12

This article also introduces the speaker-dependent scaling. The activation value of the LSTM memory unit is used for speaker-dependent scaling. As shown in the formula, the speaker-dependent scaling can be done with zszs learning for such speaker. But there is a problem in this algorithm: the zszs dimension is related with the layer width of the network, and many parameters are required. Therefore, a subspace scaling approach is presented, whereby the zszs is controlled using a low-dimensional vector vsvs at a fixed dimension, and the vsvs dimension is much less than zszs, significantly reducing the amount of speaker-dependent parameters.

13

This paper is based on a 78-hour dataset. The diagram above shows the final WER generated using the algorithm in the article. In the diagram, "none" stands for no usage of self-adaptive algorithm, and "SD bias" for no usage of SD weighted matrix in the FHL other than SD bias. CMLLR is a self-adaptive algorithm. The algorithm in the article has achieved the best performance compared with SD bias and CMLLR. The improvement in performance of LSTM-RNN is less than that of DNN, indicating that the adaption on the LSTM-RNN is more difficult.

3. Conclusion

These papers from Interspeech 2017 present many interesting ideas from various researchers on self-adaptive technology. It has personally benefitted me a lot and I hope this article can educate you more on the tenets of self-adaptive technology.

4. References

[1] Kim T, Song I, Bengio Y. Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition[J]. 2017.

[2] Li J, Seltzer M L, Wang X, et al. Large-Scale Domain Adaptation via Teacher-Student Learning[J]. 2017.

[3] Samarakoon L, Mak B, Sim K C. Learning Factorized Transforms for Unsupervised Adaptation of LSTM-RNN Acoustic Models[C]// INTERSPEECH. 2017:744-748.

目录
相关文章
|
机器学习/深度学习 搜索推荐 算法
Learning Disentangled Representations for Recommendation | NIPS 2019 论文解读
近年来随着深度学习的发展,推荐系统大量使用用户行为数据来构建用户/商品表征,并以此来构建召回、排序、重排等推荐系统中的标准模块。普通算法得到的用户商品表征本身,并不具备可解释性,而往往只能提供用户-商品之间的attention分作为商品粒度的用户兴趣。我们在这篇文章中,想仅通过用户行为,学习到本身就具备一定可解释性的解离化的用户商品表征,并试图利用这样的商品表征完成单语义可控的推荐任务。
23844 0
Learning Disentangled Representations for Recommendation | NIPS 2019 论文解读
|
机器学习/深度学习 编解码 数据可视化
Speech Emotion Recognition With Local-Global aware Deep Representation Learning论文解读
语音情感识别(SER)通过从语音信号中推断人的情绪和情感状态,在改善人与机器之间的交互方面发挥着至关重要的作用。尽管最近的工作主要集中于从手工制作的特征中挖掘时空信息,但我们探索如何从动态时间尺度中建模语音情绪的时间模式。
152 0
|
机器学习/深度学习 移动开发 数据挖掘
Understanding Few-Shot Learning in Computer Vision: What You Need to Know
Few-Shot Learning is a sub-area of machine learning. It’s about classifying new data when you have only a few training samples with supervised information. FSL is a rather young area that needs more research and refinement. As of today, you can use it in CV tasks. A computer vision model can work
184 0
Understanding Few-Shot Learning in Computer Vision: What You Need to Know
|
机器学习/深度学习 搜索推荐 算法
SysRec2016 | Deep Neural Networks for YouTube Recommendations
YouTube有很多用户原创内容,其商业模式和Netflix、国内的腾讯、爱奇艺等流媒体不同,后者是采购或自制的电影,并且YouTube的视频基数巨大,用户难以发现喜欢的内容。本文根据典型的两阶段信息检索二分法:首先描述一种深度候选生成模型,接着描述一种分离的深度排序模型。
263 0
SysRec2016 | Deep Neural Networks for YouTube Recommendations
|
机器学习/深度学习 数据挖掘 计算机视觉
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(一)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(一)
|
机器学习/深度学习 数据挖掘 计算机视觉
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(三)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
|
机器学习/深度学习 数据挖掘 Java
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(二)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
|
语音技术 自然语言处理 存储
语音顶会Interspeech 论文解读|Towards Language-Universal Mandarin-English Speech Recognition
Interspeech是世界上规模最大,最全面的顶级语音领域会议,本文为Shiliang Zhang, Yuan Liu, Ming Lei, Bin Ma, Lei Xie的入选论文
语音顶会Interspeech 论文解读|Towards Language-Universal Mandarin-English Speech Recognition
|
机器学习/深度学习 语音技术 搜索推荐
语音顶会Interspeech 论文解读|Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks
Interspeech是世界上规模最大,最全面的顶级语音领域会议,本文为Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma的入选论文
语音顶会Interspeech 论文解读|Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks
|
语音技术 数据挖掘 智能硬件
语音顶会Interspeech 论文解读|Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification
Interspeech是世界上规模最大,最全面的顶级语音领域会议,本文为Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei的入选论文
语音顶会Interspeech 论文解读|Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification