神经codec模型相关论文

本文涉及的产品
多模态交互后付费免费试用,全链路、全Agent
简介: 本文汇总了近年来在神经音频编解码器和语音语言模型领域的多项重要研究,涵盖从2020年到2024年的最新进展。这些研究包括端到端的音频编解码器、高效音频生成、高保真音频压缩、多模态表示学习等。每项研究都提供了详细的论文链接、代码和演示页面,方便读者深入了解和实验。例如,SoundStream(2021)提出了一种端到端的神经音频编解码器,而AudioLM(2022)则通过语言建模方法生成音频。此外,还有多个项目如InstructTTS、AudioDec、HiFi-Codec等,分别在表达性TTS、开源高保真音频编解码器和高保真音频压缩方面取得了显著成果。
  • [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
  • [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] :heavy_check_mark:
  • [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ :heavy_check_mark:
  • [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] :heavy_check_mark:
  • [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC :heavy_check_mark:
  • [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
  • [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec :heavy_check_mark:
  • [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] :heavy_check_mark:
  • [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec :heavy_check_mark:
  • [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] :heavy_check_mark:
  • [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] :heavy_check_mark:
  • [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco :heavy_check_mark:
  • [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] :heavy_check_mark:
  • [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] :heavy_check_mark:
  • [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
  • [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder :heavy_check_mark:
  • [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
  • [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
  • [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
  • [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse :heavy_check_mark:
  • [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec :heavy_check_mark:
  • [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] :heavy_check_mark:
  • [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
  • [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] :heavy_check_mark:
  • [2024/07] dMel: Speech Tokenization made Simple [paper] Code Comming Soon
  • [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] :heavy_check_mark:
  • [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] :heavy_check_mark:
  • [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec | Code Comming Soon
  • [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
  • [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space :heavy_check_mark:
  • [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] :heavy_check_mark:
  • [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec :heavy_check_mark:
  • [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
  • [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code] Code Comming Soon
  • [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking :heavy_check_mark:
  • [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec :heavy_check_mark:
  • [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform :heavy_check_mark:
  • [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
  • [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer :heavy_check_mark:
  • [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
  • [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec :heavy_check_mark:
  • [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC :heavy_check_mark:
  • [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] :heavy_check_mark:
  • [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation :heavy_check_mark:
  • [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
  • [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation :heavy_check_mark:
  • [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
  • [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
  • [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues :heavy_check_mark:
  • [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
  • [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
  • [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
  • [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
  • [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
  • [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
  • [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
  • [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse :heavy_check_mark:
  • [2024/11] hertz-dev [code] WaveCodec :heavy_check_mark:
  • [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
  • [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper] Code Comming Soon | proposing several codec-LM co-design strategies
  • [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
  • [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free :heavy_check_mark:
  • [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo] Code Comming Soon | Music Tokenizer, Similar to MsCodec
  • [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo] Code Comming Soon | transformer-based and scale it into 1B parameter range
  • [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
  • [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo] Code Comming Soon | speaker encoder, content encoder and prosody encoder

注:以上论文集来自GitHub仓库Neural-Codec-and-Speech-Language-Models的一部分,欢迎star

目录
相关文章
|
机器学习/深度学习 自然语言处理 算法
|
机器学习/深度学习 并行计算 Shell
docker 获取Nvidia 镜像 | cuda |cudnn
本文分享如何使用docker获取Nvidia 镜像,包括cuda10、cuda11等不同版本,cudnn7、cudnn8等,快速搭建深度学习环境。
6943 0
|
9月前
|
消息中间件 人工智能 运维
12月更文特别场——寻找用云高手,分享云&AI实践
我们寻找你,用云高手,欢迎分享你的真知灼见!
3822 101
|
机器学习/深度学习 人工智能 自然语言处理
【深度学习】AudioLM音频生成模型概述及应用场景,项目实践及案例分析
AudioLM(Audio Language Model)是一种基于深度学习的音频生成模型,它使用自回归或变分自回归的方法来生成连续的音频信号。这类模型通常建立在Transformer架构或者类似的序列到序列(Seq2Seq)框架上,通过学习大量音频数据中的统计规律,能够生成具有高保真度和创造性的音频片段。AudioLM模型不仅能够合成音乐、语音,还能生成自然界的声音、环境噪声等,其应用广泛,涵盖了娱乐、教育、辅助技术、内容创作等多个领域。
529 1
|
7月前
|
人工智能 物联网 测试技术
FireRedASR:精准识别普通话、方言和歌曲歌词!小红书开源工业级自动语音识别模型
小红书开源的工业级自动语音识别模型,支持普通话、中文方言和英语,采用 Encoder-Adapter-LLM 和 AED 架构,实现 SOTA 性能。
2313 17
FireRedASR:精准识别普通话、方言和歌曲歌词!小红书开源工业级自动语音识别模型
|
9月前
|
机器学习/深度学习 人工智能
Qwen2VL-Flux:开源的多模态图像生成模型,支持多种生成模式
Qwen2VL-Flux 是一个开源的多模态图像生成模型,结合了 Qwen2VL 的视觉语言理解和 FLUX 框架,能够基于文本提示和图像参考生成高质量的图像。该模型支持多种生成模式,包括变体生成、图像到图像转换、智能修复及 ControlNet 引导生成,具备深度估计和线条检测功能,提供灵活的注意力机制和高分辨率输出,是一站式的图像生成解决方案。
1004 4
Qwen2VL-Flux:开源的多模态图像生成模型,支持多种生成模式
|
9月前
|
存储 人工智能 API
AgentScope:阿里开源多智能体低代码开发平台,支持一键导出源码、多种模型API和本地模型部署
AgentScope是阿里巴巴集团开源的多智能体开发平台,旨在帮助开发者轻松构建和部署多智能体应用。该平台提供分布式支持,内置多种模型API和本地模型部署选项,支持多模态数据处理。
3674 39
AgentScope:阿里开源多智能体低代码开发平台,支持一键导出源码、多种模型API和本地模型部署
|
算法 物联网 5G
|
8月前
|
机器学习/深度学习 传感器 人工智能
穹彻智能-上交大最新Nature子刊速递:解析深度学习驱动的视触觉动态重建方案
上海交大研究团队在Nature子刊发表论文,提出基于深度学习的视触觉动态重建方案,结合高密度可拉伸触觉手套与视觉-触觉联合学习框架,实现手部与物体间力量型交互的实时捕捉和重建。该方案包含1152个触觉感知单元,通过应变干扰抑制方法提高测量准确性,平均重建误差仅1.8厘米。实验结果显示,其在物体重建的准确性和鲁棒性方面优于现有方法,为虚拟现实、远程医疗等领域带来新突破。
196 32
|
Python
python获取音频时长
python获取音频时长
524 0