Voiceprint Recognition System – Not Just a Powerful Authentication Tool

简介: Learn details about voiceprint recognition system and its underlying principles as a powerful authentication tool

What_s_mysterious_about_voiceprint_recognition_the_powerful_authentication_tool

Introduction

In this advanced age, when mobile Internet is the norm, people leverage social networking, online shopping and online financial transactions without the need of being physically present at places. As a result, identity authentication has become the most critical security activity in the online world. The traditional solution uses a password or a private key that you need to remember. In fact, many people prefer keeping simple passwords such as "123456" to shuttle through the Internet world. Unfortunately, this makes their online data an easy target for hackers. Traditional solutions are a risky affair as the passwords are forgotten or lost and are also prone to hacker attacks.

Are you still using the default password "admin" for your home router?
Do you know that easy-to-crack passwords are the most vulnerable link in the security realm of the Internet of Things (IoT)?

Solutions

Fortunately, we all have unique "living passwords" on our bodies, such as the fingerprints, face, voice, and eyes. They are the unique and distinctive characteristics of individuals popularly called "biometric signatures." Voice is just one way of reflecting a person's identity. In reference to the nomenclature for "fingerprint," we also call it "voiceprint."

As per the United States National Biosignature Test Center at San Jose University, "Fundamentals of Biometric Technology," below is a quick comparison of types of biometrics signatures based on various factors:

1

Comparison Between Various Biometric Signatures

Let's read about the voiceprint recognition system and its underlying principle.

About Voiceprint Recognition System

Voiceprint refers to the acoustic frequency spectrum that carries the speech information in a human voice. Like fingerprints, it has unique biometric signatures, is individual-specific, and can function as an identification method. The acoustical signal is a unidimensional continuous signal. On discretization, you will get the acoustical signal that can be processed by conventional computers.

2

Discretized Acoustical Signals Processed by Computers

Similar to the widely-used fingerprint technology on mobile phones, voiceprint recognition (also known as speaker recognition) technology is also a bio-identification technology that extracts phonetic features from the speaker's voice signals to validate the speaker's identity. Everybody has a unique voiceprint gradually formed throughout the development process of our vocal organs. No matter how remarkably similar the imitated voice can be to the original voice, their voiceprints will remain different.

The Chinese saying of "someone may not yet be here bodily, but you can already hear him/her speaking" in real life vividly describes a scene where you identify another person by the voice. This explains why your mother knows it's you before you even finish saying "hello" over the phone. This is an extraordinary ability humans have acquired through long-term evolution. With the latest technological innovations, recognition systems can quickly identify a person after listening to 8 to 10 words; it is still not feasible to identify voice with a single word. It can also distinguish if you are one of the specified 1,000 people after speaking for more than a minute. It relies on an important concept applicable to most of the biometric identification systems: 1:1 and 1: N. It also encompasses a unique concept unique for the voiceprint recognition technology: text-dependence and text-independence.

Let's learn about its principles in detail in the proceeding section.

Working Principle

1:1 Recognition System

The working model of this biometric identification system requires you to provide your identity (account) and biometric features beforehand and saves it as a template. During processing, the system compares the entered features with the stored biometric characteristics, to determine whether the two sets match. Such systems are popularly known as 1:1 recognition system (also called speaker verification).

1: N Recognition System

The working model of this biometric identification system doesn't ask for biometric features before processing. It only requires the biometric features during runtime and then compares it with all the multiple records of biometric features stored in the background to determine the right match. Such systems are popularly known as 1: N recognition system (also called speaker identification).

Figure 1 below shows a quick comparison between both the recognition systems.
3

Figure 1: Speaker Verification and Speaker Identification

Figure 2 below shows the working process of a simple voiceprint recognition system:
4

Figure 2: Working Process of a Voiceprint Recognition System

From the perspective of users' speech content, there are two types of voiceprint recognition systems, namely text-dependence and text-independence.

As their names imply, "text-dependence" refers to a system that requires the user to only say system-prompted content or content within a small allowed range, while "text-independence" does not restrict the content spoken by the user. This way, text dependence content only requires the recognition system to process the small-range acoustic differences between users. Since the content is similar, the system only needs to care about the voice differences, with relatively less difficulty. Text independence systems require a recognition system not only to consider the distinct differences between the user voices but also to process the speech differences caused by different content, with relatively higher difficulty.

At present, there is a new technology that falls between the two, popularly known as "limited text-dependence." These systems collocate some numbers or symbols at random and require users to read the corresponding content to get the voiceprint recognized. Due to this randomness, the collected voiceprints vary in content sequence every time. This feature aligns with the widely-used short random numbers (such as digital verification codes). It is useful for identity validation or, in combination with other biometric signatures such as the face, to form multiple-factor authentication systems.

Voiceprint Recognition Algorithm: the Technical Details

Let's delve a little deeper into the technical details of the voiceprint recognition algorithm. In the feature layer, the classic Mel-Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Prediction (PLP), the Deep Feature, and the Power-Normalized Cepstral Coefficients (PNCC) are all outstanding acoustic features used as inputs for model learning. However, MFCC remains the most frequently used feature.

It also allows you to use multiple features by combining any of the feature or model layers. In the machine learning model layer, the iVector framework that N.Dehak proposed in 2009 still takes a dominant role. Although the deep machine learning has been in the limelight today, and the voiceprint sector cannot escape its impact, the DNN-iVector derived from the legacy UBM-iVector framework only replaces the MFCC with the DNN (or BN) used for extracting features. Besides, it uses the DNN (or BN) as a supplement of MFCC, and the back-end learning framework remains iVector.

Figure 3 demonstrates a complete training and testing process of the voiceprint recognition system.
5

Figure 3: Complete Training and Recognition Framework of Voiceprint Recognition Algorithms

We can see that the iVector model training and the channel compensation model training that follows are the most relevant links. In the feature phase, you can use the BottleNeck feature to replace or supplement the MFCC feature and input it to the iVector framework for model training, as shown in Figure 4.
6

Figure 4: Training iVector Model with the BottleNeck Feature

In the system layer, different features and models can depict the speaker's voice features from different dimensions. Coupled with effective score normalization, various subsystems can be integrated to elevate the overall system performance substantially.

Conclusion

In this blog, we dissected and learned the basics of the voice recognition system, the details about its underlying principles, and how it plays a significant role in biometric identification industry.

目录
相关文章
|
机器学习/深度学习 编解码 人工智能
Reading Notes: Human-Computer Interaction System: A Survey of Talking-Head Generation
由于人工智能的快速发展,虚拟人被广泛应用于各种行业,包括个人辅助、智能客户服务和在线教育。拟人化的数字人可以快速与人接触,并在人机交互中增强用户体验。因此,我们设计了人机交互系统框架,包括语音识别、文本到语音、对话系统和虚拟人生成。接下来,我们通过虚拟人深度生成框架对Talking-Head Generation视频生成模型进行了分类。同时,我们系统地回顾了过去五年来在有声头部视频生成方面的技术进步和趋势,强调了关键工作并总结了数据集。 对于有关于Talking-Head Generation的方法,这是一篇比较好的综述,我想着整理一下里面比较重要的部分,大概了解近几年对虚拟人工作的一些发展和
PAT (Advanced Level) Practice:1~3题
​ ✨欢迎您的订阅✨ PAT乙级专栏(更新完毕): 👉🏻【Basic Level】👈🏻 PAT甲级专栏(更新ing): 👉🏻【Advanced Level】👈🏻 ​
PAT (Advanced Level) Practice:1~3题
PAT (Advanced Level) Practice - 1129 Recommendation System(25 分)
PAT (Advanced Level) Practice - 1129 Recommendation System(25 分)
106 0
|
JavaScript 安全 前端开发
25 Things You Should Know About Developers in China
This report aims to shed light on the software developer community and research trends in China.
3980 0
笔记:The Seven Steps to Building a Successful Software Development Company
笔记:The Seven Steps to Building a Successful Software Development Company 建立成功软件公司的七个步骤,感觉说的大都是常识,不过毕竟他整理出来了,看看也挺有意思的。
1530 0
Basic Mathematics You Should Mastered
Basic Mathematics You Should Mastered 2017-08-17  21:22:40    1. Statistical distance  In statistics, probability theory, and information theory, ...