神经codec模型相关论文

简介: 本文汇总了近年来在神经音频编解码器和语音语言模型领域的多项重要研究,涵盖从2020年到2024年的最新进展。这些研究包括端到端的音频编解码器、高效音频生成、高保真音频压缩、多模态表示学习等。每项研究都提供了详细的论文链接、代码和演示页面,方便读者深入了解和实验。例如,SoundStream(2021)提出了一种端到端的神经音频编解码器,而AudioLM(2022)则通过语言建模方法生成音频。此外,还有多个项目如InstructTTS、AudioDec、HiFi-Codec等,分别在表达性TTS、开源高保真音频编解码器和高保真音频压缩方面取得了显著成果。
  • [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
  • [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] :heavy_check_mark:
  • [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ :heavy_check_mark:
  • [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] :heavy_check_mark:
  • [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC :heavy_check_mark:
  • [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
  • [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec :heavy_check_mark:
  • [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] :heavy_check_mark:
  • [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec :heavy_check_mark:
  • [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] :heavy_check_mark:
  • [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] :heavy_check_mark:
  • [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco :heavy_check_mark:
  • [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] :heavy_check_mark:
  • [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] :heavy_check_mark:
  • [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
  • [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder :heavy_check_mark:
  • [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
  • [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
  • [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
  • [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse :heavy_check_mark:
  • [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec :heavy_check_mark:
  • [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] :heavy_check_mark:
  • [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
  • [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] :heavy_check_mark:
  • [2024/07] dMel: Speech Tokenization made Simple [paper] Code Comming Soon
  • [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] :heavy_check_mark:
  • [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] :heavy_check_mark:
  • [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec | Code Comming Soon
  • [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
  • [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space :heavy_check_mark:
  • [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] :heavy_check_mark:
  • [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec :heavy_check_mark:
  • [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
  • [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code] Code Comming Soon
  • [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking :heavy_check_mark:
  • [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec :heavy_check_mark:
  • [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform :heavy_check_mark:
  • [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
  • [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer :heavy_check_mark:
  • [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
  • [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec :heavy_check_mark:
  • [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC :heavy_check_mark:
  • [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] :heavy_check_mark:
  • [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation :heavy_check_mark:
  • [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
  • [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation :heavy_check_mark:
  • [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
  • [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
  • [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues :heavy_check_mark:
  • [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
  • [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
  • [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
  • [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
  • [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
  • [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
  • [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
  • [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse :heavy_check_mark:
  • [2024/11] hertz-dev [code] WaveCodec :heavy_check_mark:
  • [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
  • [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper] Code Comming Soon | proposing several codec-LM co-design strategies
  • [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
  • [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free :heavy_check_mark:
  • [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo] Code Comming Soon | Music Tokenizer, Similar to MsCodec
  • [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo] Code Comming Soon | transformer-based and scale it into 1B parameter range
  • [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
  • [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo] Code Comming Soon | speaker encoder, content encoder and prosody encoder

注:以上论文集来自GitHub仓库Neural-Codec-and-Speech-Language-Models的一部分,欢迎star

目录
相关文章
|
机器学习/深度学习 自然语言处理 算法
|
机器学习/深度学习 并行计算 Shell
docker 获取Nvidia 镜像 | cuda |cudnn
本文分享如何使用docker获取Nvidia 镜像,包括cuda10、cuda11等不同版本,cudnn7、cudnn8等,快速搭建深度学习环境。
7928 0
|
Shell Linux Python
基于远程服务器安装配置Anaconda环境及创建python虚拟环境详细方案(一)
基于远程服务器安装配置Anaconda环境及创建python虚拟环境详细方案
8209 0
基于远程服务器安装配置Anaconda环境及创建python虚拟环境详细方案(一)
|
11月前
|
机器学习/深度学习 JavaScript PyTorch
9个主流GAN损失函数的数学原理和Pytorch代码实现:从经典模型到现代变体
生成对抗网络(GAN)的训练效果高度依赖于损失函数的选择。本文介绍了经典GAN损失函数理论,并用PyTorch实现多种变体,包括原始GAN、LS-GAN、WGAN及WGAN-GP等。通过分析其原理与优劣,如LS-GAN提升训练稳定性、WGAN-GP改善图像质量,展示了不同场景下损失函数的设计思路。代码实现覆盖生成器与判别器的核心逻辑,为实际应用提供了重要参考。未来可探索组合优化与自适应设计以提升性能。
976 7
9个主流GAN损失函数的数学原理和Pytorch代码实现:从经典模型到现代变体
|
7月前
|
自然语言处理 API 语音技术
是时候说点方言了,Qwen-TTS上新!
Qwen-TTS更新支持北京话、上海话和四川话三种中文方言,新增七种中英双语音色。模型基于超300万小时语料训练,合成语音自然流畅,可自动调整韵律与情绪。用户可通过Qwen API便捷调用,体验多语言、多风格的高质量语音生成服务。
1380 1
|
算法 PyTorch 算法框架/工具
Pytorch学习笔记(九):Pytorch模型的FLOPs、模型参数量等信息输出(torchstat、thop、ptflops、torchsummary)
本文介绍了如何使用torchstat、thop、ptflops和torchsummary等工具来计算Pytorch模型的FLOPs、模型参数量等信息。
2761 2
|
存储 人工智能 API
AgentScope:阿里开源多智能体低代码开发平台,支持一键导出源码、多种模型API和本地模型部署
AgentScope是阿里巴巴集团开源的多智能体开发平台,旨在帮助开发者轻松构建和部署多智能体应用。该平台提供分布式支持,内置多种模型API和本地模型部署选项,支持多模态数据处理。
8429 77
AgentScope:阿里开源多智能体低代码开发平台,支持一键导出源码、多种模型API和本地模型部署
|
12月前
|
人工智能 物联网 测试技术
FireRedASR:精准识别普通话、方言和歌曲歌词!小红书开源工业级自动语音识别模型
小红书开源的工业级自动语音识别模型,支持普通话、中文方言和英语,采用 Encoder-Adapter-LLM 和 AED 架构,实现 SOTA 性能。
3816 17
FireRedASR:精准识别普通话、方言和歌曲歌词!小红书开源工业级自动语音识别模型
|
机器学习/深度学习 人工智能 自然语言处理
【深度学习】AudioLM音频生成模型概述及应用场景,项目实践及案例分析
AudioLM(Audio Language Model)是一种基于深度学习的音频生成模型,它使用自回归或变分自回归的方法来生成连续的音频信号。这类模型通常建立在Transformer架构或者类似的序列到序列(Seq2Seq)框架上,通过学习大量音频数据中的统计规律,能够生成具有高保真度和创造性的音频片段。AudioLM模型不仅能够合成音乐、语音,还能生成自然界的声音、环境噪声等,其应用广泛,涵盖了娱乐、教育、辅助技术、内容创作等多个领域。
856 1