Voiceprint Recognition System – Not Just a Powerful Authentication Tool

简介: Learn details about voiceprint recognition system and its underlying principles as a powerful authentication tool

What_s_mysterious_about_voiceprint_recognition_the_powerful_authentication_tool

Introduction

In this advanced age, when mobile Internet is the norm, people leverage social networking, online shopping and online financial transactions without the need of being physically present at places. As a result, identity authentication has become the most critical security activity in the online world. The traditional solution uses a password or a private key that you need to remember. In fact, many people prefer keeping simple passwords such as "123456" to shuttle through the Internet world. Unfortunately, this makes their online data an easy target for hackers. Traditional solutions are a risky affair as the passwords are forgotten or lost and are also prone to hacker attacks.

Are you still using the default password "admin" for your home router?
Do you know that easy-to-crack passwords are the most vulnerable link in the security realm of the Internet of Things (IoT)?

Solutions

Fortunately, we all have unique "living passwords" on our bodies, such as the fingerprints, face, voice, and eyes. They are the unique and distinctive characteristics of individuals popularly called "biometric signatures." Voice is just one way of reflecting a person's identity. In reference to the nomenclature for "fingerprint," we also call it "voiceprint."

As per the United States National Biosignature Test Center at San Jose University, "Fundamentals of Biometric Technology," below is a quick comparison of types of biometrics signatures based on various factors:

1

Comparison Between Various Biometric Signatures

Let's read about the voiceprint recognition system and its underlying principle.

About Voiceprint Recognition System

Voiceprint refers to the acoustic frequency spectrum that carries the speech information in a human voice. Like fingerprints, it has unique biometric signatures, is individual-specific, and can function as an identification method. The acoustical signal is a unidimensional continuous signal. On discretization, you will get the acoustical signal that can be processed by conventional computers.

2

Discretized Acoustical Signals Processed by Computers

Similar to the widely-used fingerprint technology on mobile phones, voiceprint recognition (also known as speaker recognition) technology is also a bio-identification technology that extracts phonetic features from the speaker's voice signals to validate the speaker's identity. Everybody has a unique voiceprint gradually formed throughout the development process of our vocal organs. No matter how remarkably similar the imitated voice can be to the original voice, their voiceprints will remain different.

The Chinese saying of "someone may not yet be here bodily, but you can already hear him/her speaking" in real life vividly describes a scene where you identify another person by the voice. This explains why your mother knows it's you before you even finish saying "hello" over the phone. This is an extraordinary ability humans have acquired through long-term evolution. With the latest technological innovations, recognition systems can quickly identify a person after listening to 8 to 10 words; it is still not feasible to identify voice with a single word. It can also distinguish if you are one of the specified 1,000 people after speaking for more than a minute. It relies on an important concept applicable to most of the biometric identification systems: 1:1 and 1: N. It also encompasses a unique concept unique for the voiceprint recognition technology: text-dependence and text-independence.

Let's learn about its principles in detail in the proceeding section.

Working Principle

1:1 Recognition System

The working model of this biometric identification system requires you to provide your identity (account) and biometric features beforehand and saves it as a template. During processing, the system compares the entered features with the stored biometric characteristics, to determine whether the two sets match. Such systems are popularly known as 1:1 recognition system (also called speaker verification).

1: N Recognition System

The working model of this biometric identification system doesn't ask for biometric features before processing. It only requires the biometric features during runtime and then compares it with all the multiple records of biometric features stored in the background to determine the right match. Such systems are popularly known as 1: N recognition system (also called speaker identification).

Figure 1 below shows a quick comparison between both the recognition systems.
3

Figure 1: Speaker Verification and Speaker Identification

Figure 2 below shows the working process of a simple voiceprint recognition system:
4

Figure 2: Working Process of a Voiceprint Recognition System

From the perspective of users' speech content, there are two types of voiceprint recognition systems, namely text-dependence and text-independence.

As their names imply, "text-dependence" refers to a system that requires the user to only say system-prompted content or content within a small allowed range, while "text-independence" does not restrict the content spoken by the user. This way, text dependence content only requires the recognition system to process the small-range acoustic differences between users. Since the content is similar, the system only needs to care about the voice differences, with relatively less difficulty. Text independence systems require a recognition system not only to consider the distinct differences between the user voices but also to process the speech differences caused by different content, with relatively higher difficulty.

At present, there is a new technology that falls between the two, popularly known as "limited text-dependence." These systems collocate some numbers or symbols at random and require users to read the corresponding content to get the voiceprint recognized. Due to this randomness, the collected voiceprints vary in content sequence every time. This feature aligns with the widely-used short random numbers (such as digital verification codes). It is useful for identity validation or, in combination with other biometric signatures such as the face, to form multiple-factor authentication systems.

Voiceprint Recognition Algorithm: the Technical Details

Let's delve a little deeper into the technical details of the voiceprint recognition algorithm. In the feature layer, the classic Mel-Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Prediction (PLP), the Deep Feature, and the Power-Normalized Cepstral Coefficients (PNCC) are all outstanding acoustic features used as inputs for model learning. However, MFCC remains the most frequently used feature.

It also allows you to use multiple features by combining any of the feature or model layers. In the machine learning model layer, the iVector framework that N.Dehak proposed in 2009 still takes a dominant role. Although the deep machine learning has been in the limelight today, and the voiceprint sector cannot escape its impact, the DNN-iVector derived from the legacy UBM-iVector framework only replaces the MFCC with the DNN (or BN) used for extracting features. Besides, it uses the DNN (or BN) as a supplement of MFCC, and the back-end learning framework remains iVector.

Figure 3 demonstrates a complete training and testing process of the voiceprint recognition system.
5

Figure 3: Complete Training and Recognition Framework of Voiceprint Recognition Algorithms

We can see that the iVector model training and the channel compensation model training that follows are the most relevant links. In the feature phase, you can use the BottleNeck feature to replace or supplement the MFCC feature and input it to the iVector framework for model training, as shown in Figure 4.
6

Figure 4: Training iVector Model with the BottleNeck Feature

In the system layer, different features and models can depict the speaker's voice features from different dimensions. Coupled with effective score normalization, various subsystems can be integrated to elevate the overall system performance substantially.

Conclusion

In this blog, we dissected and learned the basics of the voice recognition system, the details about its underlying principles, and how it plays a significant role in biometric identification industry.

目录
相关文章
|
人工智能 自然语言处理 NoSQL
Graph + LLM 实践指南|如何使用自然语言进行知识图谱构建和查询
经过悦数研发团队的努力和与国际多家知名大语言模型 LLM 技术团队的合作,目前悦数图数据库的产品已经可以实现基于 Graph + LLM 技术的 Text2Cypher,即自然语言生成图查询。用户只需要在对话界面中通过自然语言就可以轻松实现知识图谱的构建和查询,更有开箱即用的企业级服务,欢迎大家在文末点击试玩体验新一代的悦数图数据库 x 知识图谱应用吧!
|
机器学习/深度学习 人工智能 数据可视化
技术开源|语音情感基座模型emotion2vec
技术开源|语音情感基座模型emotion2vec
|
机器学习/深度学习 人工智能 Linux
SAM 2.1:Meta 开源的图像和视频分割,支持实时视频处理
SAM 2.1是由Meta(Facebook的母公司)推出的先进视觉分割模型,专为图像和视频处理设计。该模型基于Transformer架构和流式记忆设计,实现了实时视频处理,并引入了数据增强技术,提升了对视觉相似物体和小物体的识别能力。SAM 2.1的主要功能包括图像和视频分割、实时视频处理、用户交互式分割、多对象跟踪以及改进的遮挡处理能力。
1498 6
SAM 2.1:Meta 开源的图像和视频分割,支持实时视频处理
|
机器学习/深度学习 分布式计算 监控
大模型开发:你如何使用大数据进行模型训练?
在大数据模型训练中,关键步骤包括数据准备(收集、清洗、特征工程、划分),硬件准备(分布式计算、并行训练),模型选择与配置,训练与优化,监控评估,以及模型的持久化与部署。过程中要关注数据隐私、安全及法规遵循,利用技术进步提升效率和性能。
2995 2
|
应用服务中间件 nginx
安装nginx-rtmp-module模块与配置
安装nginx-rtmp-module模块与配置
|
缓存 安全 Java
【技术干货】DSL和DDD
DSL与DDD可以优化代码写法,和提高效率。
1233 0
【技术干货】DSL和DDD
如何不让input输入框显示或禁止历史记录
如何不让input输入框显示或禁止历史记录
1196 0
|
机器学习/深度学习
神经网络与深度学习---验证集(测试集)准确率高于训练集准确率的原因
神经网络与深度学习---验证集(测试集)准确率高于训练集准确率的原因
4607 2
|
存储 数据库
Drools规则模板使用之Excel
Drools规则模板使用之Excel
838 0

热门文章

最新文章