神经codec模型相关论文

简介: 本文汇总了近年来在神经音频编解码器和语音语言模型领域的多项重要研究,涵盖从2020年到2024年的最新进展。这些研究包括端到端的音频编解码器、高效音频生成、高保真音频压缩、多模态表示学习等。每项研究都提供了详细的论文链接、代码和演示页面,方便读者深入了解和实验。例如,SoundStream(2021)提出了一种端到端的神经音频编解码器,而AudioLM(2022)则通过语言建模方法生成音频。此外,还有多个项目如InstructTTS、AudioDec、HiFi-Codec等,分别在表达性TTS、开源高保真音频编解码器和高保真音频压缩方面取得了显著成果。
  • [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
  • [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] :heavy_check_mark:
  • [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ :heavy_check_mark:
  • [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] :heavy_check_mark:
  • [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC :heavy_check_mark:
  • [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
  • [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec :heavy_check_mark:
  • [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] :heavy_check_mark:
  • [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec :heavy_check_mark:
  • [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] :heavy_check_mark:
  • [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] :heavy_check_mark:
  • [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco :heavy_check_mark:
  • [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] :heavy_check_mark:
  • [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] :heavy_check_mark:
  • [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
  • [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder :heavy_check_mark:
  • [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
  • [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
  • [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
  • [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse :heavy_check_mark:
  • [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec :heavy_check_mark:
  • [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] :heavy_check_mark:
  • [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
  • [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] :heavy_check_mark:
  • [2024/07] dMel: Speech Tokenization made Simple [paper] Code Comming Soon
  • [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] :heavy_check_mark:
  • [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] :heavy_check_mark:
  • [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec | Code Comming Soon
  • [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
  • [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space :heavy_check_mark:
  • [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] :heavy_check_mark:
  • [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec :heavy_check_mark:
  • [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
  • [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code] Code Comming Soon
  • [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking :heavy_check_mark:
  • [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec :heavy_check_mark:
  • [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform :heavy_check_mark:
  • [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
  • [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer :heavy_check_mark:
  • [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
  • [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec :heavy_check_mark:
  • [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC :heavy_check_mark:
  • [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] :heavy_check_mark:
  • [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation :heavy_check_mark:
  • [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
  • [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation :heavy_check_mark:
  • [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
  • [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
  • [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues :heavy_check_mark:
  • [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
  • [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
  • [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
  • [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
  • [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
  • [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
  • [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
  • [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse :heavy_check_mark:
  • [2024/11] hertz-dev [code] WaveCodec :heavy_check_mark:
  • [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
  • [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper] Code Comming Soon | proposing several codec-LM co-design strategies
  • [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
  • [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free :heavy_check_mark:
  • [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo] Code Comming Soon | Music Tokenizer, Similar to MsCodec
  • [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo] Code Comming Soon | transformer-based and scale it into 1B parameter range
  • [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
  • [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo] Code Comming Soon | speaker encoder, content encoder and prosody encoder

注:以上论文集来自GitHub仓库Neural-Codec-and-Speech-Language-Models的一部分,欢迎star

目录
相关文章
|
6天前
|
存储 运维 安全
云上金融量化策略回测方案与最佳实践
2024年11月29日,阿里云在上海举办金融量化策略回测Workshop,汇聚多位行业专家,围绕量化投资的最佳实践、数据隐私安全、量化策略回测方案等议题进行深入探讨。活动特别设计了动手实践环节,帮助参会者亲身体验阿里云产品功能,涵盖EHPC量化回测和Argo Workflows量化回测两大主题,旨在提升量化投研效率与安全性。
云上金融量化策略回测方案与最佳实践
|
8天前
|
人工智能 自然语言处理 前端开发
从0开始打造一款APP:前端+搭建本机服务,定制暖冬卫衣先到先得
通义灵码携手科技博主@玺哥超carry 打造全网第一个完整的、面向普通人的自然语言编程教程。完全使用 AI,再配合简单易懂的方法,只要你会打字,就能真正做出一个完整的应用。
8059 19
|
12天前
|
Cloud Native Apache 流计算
资料合集|Flink Forward Asia 2024 上海站
Apache Flink 年度技术盛会聚焦“回顾过去,展望未来”,涵盖流式湖仓、流批一体、Data+AI 等八大核心议题,近百家厂商参与,深入探讨前沿技术发展。小松鼠为大家整理了 FFA 2024 演讲 PPT ,可在线阅读和下载。
4375 10
资料合集|Flink Forward Asia 2024 上海站
|
20天前
|
人工智能 自动驾驶 大数据
预告 | 阿里云邀您参加2024中国生成式AI大会上海站,马上报名
大会以“智能跃进 创造无限”为主题,设置主会场峰会、分会场研讨会及展览区,聚焦大模型、AI Infra等热点议题。阿里云智算集群产品解决方案负责人丛培岩将出席并发表《高性能智算集群设计思考与实践》主题演讲。观众报名现已开放。
|
12天前
|
自然语言处理 数据可视化 API
Qwen系列模型+GraphRAG/LightRAG/Kotaemon从0开始构建中医方剂大模型知识图谱问答
本文详细记录了作者在短时间内尝试构建中医药知识图谱的过程,涵盖了GraphRAG、LightRAG和Kotaemon三种图RAG架构的对比与应用。通过实际操作,作者不仅展示了如何利用这些工具构建知识图谱,还指出了每种工具的优势和局限性。尽管初步构建的知识图谱在数据处理、实体识别和关系抽取等方面存在不足,但为后续的优化和改进提供了宝贵的经验和方向。此外,文章强调了知识图谱构建不仅仅是技术问题,还需要深入整合领域知识和满足用户需求,体现了跨学科合作的重要性。
|
7天前
|
人工智能 容器
三句话开发一个刮刮乐小游戏!暖ta一整个冬天!
本文介绍了如何利用千问开发一款情侣刮刮乐小游戏,通过三步简单指令实现从单个功能到整体框架,再到多端优化的过程,旨在为生活增添乐趣,促进情感交流。在线体验地址已提供,鼓励读者动手尝试,探索编程与AI结合的无限可能。
三句话开发一个刮刮乐小游戏!暖ta一整个冬天!
|
1月前
|
存储 人工智能 弹性计算
阿里云弹性计算_加速计算专场精华概览 | 2024云栖大会回顾
2024年9月19-21日,2024云栖大会在杭州云栖小镇举行,阿里云智能集团资深技术专家、异构计算产品技术负责人王超等多位产品、技术专家,共同带来了题为《AI Infra的前沿技术与应用实践》的专场session。本次专场重点介绍了阿里云AI Infra 产品架构与技术能力,及用户如何使用阿里云灵骏产品进行AI大模型开发、训练和应用。围绕当下大模型训练和推理的技术难点,专家们分享了如何在阿里云上实现稳定、高效、经济的大模型训练,并通过多个客户案例展示了云上大模型训练的显著优势。
104582 10
|
7天前
|
消息中间件 人工智能 运维
12月更文特别场——寻找用云高手,分享云&AI实践
我们寻找你,用云高手,欢迎分享你的真知灼见!
642 40
|
5天前
|
弹性计算 运维 监控
阿里云云服务诊断工具:合作伙伴架构师的深度洞察与优化建议
作为阿里云的合作伙伴架构师,我深入体验了其云服务诊断工具,该工具通过实时监控与历史趋势分析,自动化检查并提供详细的诊断报告,极大提升了运维效率和系统稳定性,特别在处理ECS实例资源不可用等问题时表现突出。此外,它支持预防性维护,帮助识别潜在问题,减少业务中断。尽管如此,仍建议增强诊断效能、扩大云产品覆盖范围、提供自定义诊断选项、加强教育与培训资源、集成第三方工具,以进一步提升用户体验。
631 243
|
2天前
|
弹性计算 运维 监控
云服务测评 | 基于云服务诊断全方位监管云产品
本文介绍了阿里云的云服务诊断功能,包括健康状态和诊断两大核心功能。作者通过个人账号体验了该服务,指出其在监控云资源状态和快速排查异常方面的优势,同时也提出了一些改进建议,如增加告警配置入口和扩大诊断范围等。