一、基于PaddleSpeech的低复杂度家庭环境音识别
地址: challenge.xfyun.cn/topic/info?…
项目地址:aistudio.baidu.com/aistudio/pr…
1.赛事背景
声音作为一种重要的信息载体,由于其易收集、不受角度和光线的限制等优点,常被用于辅助环境感知和信息决策,故语音控制普遍应用于智能家居系统。智能设备接收并处理环境中的声音信号,通过声音事件识别技术可以侦测判断出环境中的物体与发生的事件,例如婴儿哭泣声、枪声和敲门声等,并能迅速地感知到环境中的变化,例如脚步声由远及近等,系统据此启动相关的智能设备。因此,声音事件识别技术已被用于安防监控、音频内容检索等智能感知等领域中,为新型的人机交互方式和智能机器听觉系统提供了帮助。
但针对应用侧存在两大主要挑战:1. 数据层面:因环境复杂,含有较多杂音;2. 设备层面:智能家居硬件设备计算力及存储有限。
2.赛事任务
声音识别事件需强大的数据作为支撑,本次大赛提供了品冠科技云平台音频数据作为训练样本,包括6类音频数据:看电视的声音、燃气报警的声音、炒菜的声音、流水的声音、拉窗帘的声音和小孩哭泣的声音,它们的标签分别为1、2、3、4、5、6。音频文件名含有声音类型,参赛者可以据此对文件进行分类。出于数据安全保证的考虑,所有数据均为脱敏处理后的数据。参赛选手需基于提供的样本构建低复杂度量化模型,通过输入音频数据预测声音对应的事件(预测声音的类型)。
本次比赛有模型复杂度限制,模型复杂度以参数量作为度量。参赛选手提交的模型参数量需小于1M,模型参数为量化后INT8形式。模型参数量统计方法统一如下:
github.com/AlbertoAnci… ;量化过程可采用任意量化方法。
二、数据集处理
1.数据集格式处理
!wget https://ai-contest-static.xfyun.cn/2022/%E6%95%B0%E6%8D%AE%E9%9B%86/%E4%BD%8E%E5%A4%8D%E6%9D%82%E5%BA%A6%E5%AE%B6%E5%BA%AD%E7%8E%AF%E5%A2%83%E9%9F%B3%E6%8C%91%E6%88%98%E8%B5%9B%E5%85%AC%E5%BC%80%E6%95%B0%E6%8D%AE.zip -O dataset.zip
--2022-08-29 09:31:26-- https://ai-contest-static.xfyun.cn/2022/%E6%95%B0%E6%8D%AE%E9%9B%86/%E4%BD%8E%E5%A4%8D%E6%9D%82%E5%BA%A6%E5%AE%B6%E5%BA%AD%E7%8E%AF%E5%A2%83%E9%9F%B3%E6%8C%91%E6%88%98%E8%B5%9B%E5%85%AC%E5%BC%80%E6%95%B0%E6%8D%AE.zip 正在解析主机 ai-contest-static.xfyun.cn (ai-contest-static.xfyun.cn)... 220.181.53.219 正在连接 ai-contest-static.xfyun.cn (ai-contest-static.xfyun.cn)|220.181.53.219|:443... 已连接。 已发出 HTTP 请求,正在等待回应... 200 OK 长度: 2361442488 (2.2G) [application/zip] 正在保存至: “dataset.zip” dataset.zip 100%[===================>] 2.20G 7.34MB/s in 4m 54s 2022-08-29 09:36:21 (7.65 MB/s) - 已保存 “dataset.zip” [2361442488/2361442488])
!unzip -qoa -O GBK dataset.zip
!mv 低复杂度家庭环境音挑战赛公开数据 dataset
2.PaddleSpeech安装
!python -m pip install -U -q pip --user !pip install -q pytest-runner !pip install -q paddlespeech
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. parl 1.4.1 requires pyzmq==18.1.1, but you have pyzmq 23.2.0 which is incompatible.[0m[31m [0m
3.查看声音文件
import warnings warnings.filterwarnings("ignore") import IPython import numpy as np import matplotlib.pyplot as plt import paddle %matplotlib inline
from paddlespeech.audio import load data, sr = load(file='dataset/train/1_看电视/001.wav', mono=True, dtype='float32') # 单通道,float32音频样本点 print('wav shape: {}'.format(data.shape)) print('sample rate: {}'.format(sr)) # 展示音频波形 plt.figure() plt.plot(data) plt.show()
wav shape: (1920000,) sample rate: 16000 /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working if isinstance(obj, collections.Iterator): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working return list(data) if isinstance(data, collections.MappingView) else data
4.音频文件长度处理
# 查音频长度 import contextlib import wave def get_sound_len(file_path): with contextlib.closing(wave.open(file_path, 'r')) as f: frames = f.getnframes() rate = f.getframerate() wav_length = frames / float(rate) return wav_length # 编译wav文件 import glob sound_files=glob.glob('dataset/train/*/*.wav') print(sound_files[0]) print(len(sound_files)) # 统计最长、最短音频 sounds_len=[] for sound in sound_files: sounds_len.append(get_sound_len(sound)) print("音频最大长度:",max(sounds_len),"秒") print("音频最小长度:",min(sounds_len),"秒")
dataset/train/3_炒菜/091.wav 616 音频最大长度: 120.0 秒 音频最小长度: 120.0 秒
最长的声音为120秒,现统一尺寸到该长度
!pip install pydub -q
# 音频信息查看 import math import soundfile as sf import numpy as np import librosa data, samplerate = sf.read('dataset/train/1_看电视/001.wav') channels = len(data.shape) length_s = len(data)/float(samplerate) format_rate=16000 print(f"channels: {channels}") print(f"length_s: {length_s}") print(f"samplerate: {samplerate}")
channels: 2 length_s: 120.0 samplerate: 16000
label_list = ['1_看电视', '2_燃气报警', '3_炒菜', '4_流水', '5_拉窗帘', '6_小孩哭泣']
# 定义函数,如未达到最大长度,则重复填充,最终从超过34s的音频中截取 from pydub import AudioSegment def convert_sound_len(filename): audio = AudioSegment.from_wav(filename) i = 1 padded = audio*i while padded.duration_seconds * 1000 < 120000: i = i + 1 padded = audio * i padded[0:120000].set_frame_rate(16000).export(filename, format='wav')
# 统一所有音频到定长 for sound in sound_files: convert_sound_len(sound)
5.生成文件列表
按 9:1 生成train和val文件列表
import os import random def get_data_list(target_path,train_list_path,eval_list_path): ''' 生成数据列表 ''' # 获取所有类别保存的文件夹名称 data_list_path=target_path class_dirs = os.listdir(data_list_path) if '__MACOSX' in class_dirs: class_dirs.remove('__MACOSX') # 存储要写进eval.txt和train.txt中的内容 trainer_list=[] eval_list=[] #读取每个类别 ########################## random.shuffle(class_dirs) ########################## for class_dir in class_dirs: class_label=label_list.index(class_dir) i = 0 if class_dir != ".DS_Store": path = os.path.join(data_list_path,class_dir) # 获取所有图片 img_paths = os.listdir(path) for img_path in img_paths: # 遍历文件夹下的每个图片 if img_path =='.DS_Store': continue i += 1 name_path = os.path.join(path,img_path) # 每张图片的路径 if i % 10 == 0: eval_list.append(name_path + ",%d" % class_label + "\n") else: trainer_list.append(name_path + ",%d" % class_label + "\n") class_label += 1 with open(eval_list_path, 'a') as f: for eval_image in eval_list: f.write(eval_image) #乱序 random.shuffle(trainer_list) with open(train_list_path, 'a') as f2: for train_image in trainer_list: f2.write(train_image) print ('生成数据列表完成!')
target_path="dataset/train" train_list_path='train_list.csv' eval_list_path='eval_list.csv' #每次生成数据列表前,首先清空train_list.csv和eval_list.csv with open(train_list_path, 'w') as f: f.seek(0) f.truncate() with open(eval_list_path, 'w') as f: f.seek(0) f.truncate() #生成数据列表 get_data_list(target_path,train_list_path,eval_list_path)
生成数据列表完成!
6.自定义数据集
import os from paddlespeech.audio.datasets.dataset import AudioClassificationDataset class CustomDataset(AudioClassificationDataset): # 初始化 def __init__(self, mode, **kwargs): files, labels = self._get_data(mode) super(CustomDataset, self).__init__( files=files, labels=labels, feat_type='raw', **kwargs) # 返回音频文件、label值 def _get_data(self, mode): files = [] labels = [] file_list=f"{mode}_list.csv" with open(file_list,'r') as f: lines=f.readlines() for line in lines: files.append(line.split(',')[0]) labels.append(line.split(',')[-1]) return files, labels
# 定义dataloader import paddle from paddlespeech.audio.features import LogMelSpectrogram # Feature config should be align with pretrained model sample_rate = 16000 feat_conf = { 'sr': sample_rate, 'n_fft': 1024, 'hop_length': 320, 'window': 'hann', 'win_length': 1024, 'f_min': 50.0, 'f_max': 14000.0, 'n_mels': 64, } feature_extractor = LogMelSpectrogram(**feat_conf) batch_size=16 train_ds = CustomDataset(mode="train", sample_rate=sample_rate) train_loader = paddle.io.DataLoader( train_ds, batch_size=batch_size, shuffle=True) eval_ds = CustomDataset(mode="eval", sample_rate=sample_rate) dev_loader = paddle.io.DataLoader( eval_ds, batch_size=batch_size)
W0830 11:09:22.568195 6840 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1 W0830 11:09:22.571982 6840 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
三、模型训练
1.选取预训练模型
选取cnn14作为 backbone,用于提取音频的特征:
from paddlespeech.cls.models import cnn14 backbone = cnn14(pretrained=True, extract_embedding=True)
[2022-08-30 11:09:23,739] [ INFO] - PaddleAudio | unique_endpoints {''} [2022-08-30 11:09:23,742] [ INFO] - PaddleAudio | Found /home/aistudio/.paddlespeech/models/panns/panns_cnn14.pdparams
2.构建分类模型
SoundClassifer接收cnn14作为backbone模型,并创建下游的分类网络:
import paddle.nn as nn class SoundClassifier(nn.Layer): def __init__(self, backbone, num_class, dropout=0.1): super().__init__() self.backbone = backbone self.dropout = nn.Dropout(dropout) self.fc = nn.Linear(self.backbone.emb_size, num_class) def forward(self, x): x = x.unsqueeze(1) x = self.backbone(x) x = self.dropout(x) logits = self.fc(x) return logits model = SoundClassifier(backbone, num_class=6)
3.finetune
# 定义优化器和 Loss optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=model.parameters()) criterion = paddle.nn.loss.CrossEntropyLoss()
from paddlespeech.audio.utils import logger epochs = 20 steps_per_epoch = len(train_loader) log_freq = 10 eval_freq = 10 for epoch in range(1, epochs + 1): model.train() avg_loss = 0 num_corrects = 0 num_samples = 0 for batch_idx, batch in enumerate(train_loader): waveforms, labels = batch feats = feature_extractor(waveforms) feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N] logits = model(feats) loss = criterion(logits, labels) loss.backward() optimizer.step() if isinstance(optimizer._learning_rate, paddle.optimizer.lr.LRScheduler): optimizer._learning_rate.step() optimizer.clear_grad() # Calculate loss avg_loss += loss.numpy()[0] # Calculate metrics preds = paddle.argmax(logits, axis=1) num_corrects += (preds == labels).numpy().sum() num_samples += feats.shape[0] if (batch_idx + 1) % log_freq == 0: lr = optimizer.get_lr() avg_loss /= log_freq avg_acc = num_corrects / num_samples print_msg = 'Epoch={}/{}, Step={}/{}'.format( epoch, epochs, batch_idx + 1, steps_per_epoch) print_msg += ' loss={:.4f}'.format(avg_loss) print_msg += ' acc={:.4f}'.format(avg_acc) print_msg += ' lr={:.6f}'.format(lr) logger.train(print_msg) avg_loss = 0 num_corrects = 0 num_samples = 0 if epoch % eval_freq == 0 and batch_idx + 1 == steps_per_epoch: model.eval() num_corrects = 0 num_samples = 0 with logger.processing('Evaluation on validation dataset'): for batch_idx, batch in enumerate(dev_loader): waveforms, labels = batch feats = feature_extractor(waveforms) feats = paddle.transpose(feats, [0, 2, 1]) logits = model(feats) preds = paddle.argmax(logits, axis=1) num_corrects += (preds == labels).numpy().sum() num_samples += feats.shape[0] print_msg = '[Evaluation result]' print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples) logger.eval(print_msg)
[2022-08-30 11:22:36,427] [ TRAIN] - PaddleAudio | Epoch=17/20, Step=30/35 loss=0.0292 acc=0.9938 lr=0.000100 [2022-08-30 11:22:56,232] [ TRAIN] - PaddleAudio | Epoch=18/20, Step=10/35 loss=0.1053 acc=0.9625 lr=0.000100 [2022-08-30 11:23:09,429] [ TRAIN] - PaddleAudio | Epoch=18/20, Step=20/35 loss=0.0349 acc=1.0000 lr=0.000100 [2022-08-30 11:23:22,678] [ TRAIN] - PaddleAudio | Epoch=18/20, Step=30/35 loss=0.0217 acc=1.0000 lr=0.000100 [2022-08-30 11:23:42,471] [ TRAIN] - PaddleAudio | Epoch=19/20, Step=10/35 loss=0.0464 acc=0.9875 lr=0.000100 [2022-08-30 11:23:55,696] [ TRAIN] - PaddleAudio | Epoch=19/20, Step=20/35 loss=0.0748 acc=0.9750 lr=0.000100 [2022-08-30 11:24:08,908] [ TRAIN] - PaddleAudio | Epoch=19/20, Step=30/35 loss=0.0855 acc=0.9750 lr=0.000100 [2022-08-30 11:24:28,751] [ TRAIN] - PaddleAudio | Epoch=20/20, Step=10/35 loss=0.0456 acc=0.9875 lr=0.000100 [2022-08-30 11:24:41,975] [ TRAIN] - PaddleAudio | Epoch=20/20, Step=20/35 loss=0.0383 acc=0.9875 lr=0.000100 [2022-08-30 11:24:55,153] [ TRAIN] - PaddleAudio | Epoch=20/20, Step=30/35 loss=0.0494 acc=1.0000 lr=0.000100 [2022-08-30 11:25:03,517] Evaluation on validation dataset: - - PaddleAudio | Evaluation on validation dataset: \ - PaddleAudio | [Evaluation result] dev_acc=0.8983
四、模型预测
import glob test_files=glob.glob("dataset/test/*.wav") print(len(test_files))
199
top_k = 3 n_fft = 1024 win_length = 1024 hop_length = 320 f_min=50.0 f_max=16000.0
wav_file = 'dataset/test/001.wav' waveform, sr = load(wav_file, sr=sr) feature_extractor = LogMelSpectrogram( sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', f_min=f_min, f_max=f_max, n_mels=64) feats = feature_extractor(paddle.to_tensor(paddle.to_tensor(waveform).unsqueeze(0))) feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N] logits = model(feats) probs = nn.functional.softmax(logits, axis=1).numpy() sorted_indices = probs[0].argsort()
print(sorted_indices)
[4 3 1 2 0 5]
6_小孩哭泣: 0.92871 1_看电视: 0.04996 3_炒菜: 0.02015
from paddlespeech.audio import load f=open("result.csv",'w') f.write('id,label\n') for wav_file in test_files: waveform, sr = load(wav_file) feature_extractor = LogMelSpectrogram( sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', f_min=f_min, f_max=f_max, n_mels=64) feats = feature_extractor(paddle.to_tensor(paddle.to_tensor(waveform).unsqueeze(0))) feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N] logits = model(feats) probs = nn.functional.softmax(logits, axis=1).numpy() sorted_indices = probs[0].argsort() filename=os.path.basename(wav_file) label=sorted_indices[-1]+1 # print(f'{filename}, {label}') f.write(f'{filename},{label}\n') f.close()
下载 result.csv 提交即可得到分数。