- [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] :heavy_check_mark:
- [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
- [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] :heavy_check_mark:
- [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] :heavy_check_mark:
- [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ :heavy_check_mark:
- [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] :heavy_check_mark:
- [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC :heavy_check_mark:
- [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
- [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec :heavy_check_mark:
- [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] :heavy_check_mark:
- [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec :heavy_check_mark:
- [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] :heavy_check_mark:
- [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] :heavy_check_mark:
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] :heavy_check_mark:
- [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] :heavy_check_mark:
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco :heavy_check_mark:
- [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] :heavy_check_mark:
- [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] :heavy_check_mark:
- [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] :heavy_check_mark:
- [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
- [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder :heavy_check_mark:
- [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
- [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
- [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
- [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] :heavy_check_mark:
- [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse :heavy_check_mark:
- [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec :heavy_check_mark:
- [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] :heavy_check_mark:
- [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] :heavy_check_mark:
- [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] :heavy_check_mark:
- [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
- [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] :heavy_check_mark:
- [2024/07] dMel: Speech Tokenization made Simple [paper]
Code Comming Soon
- [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] :heavy_check_mark:
- [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] :heavy_check_mark:
- [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] :heavy_check_mark:
- [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec |
Code Comming Soon
- [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
- [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space :heavy_check_mark:
- [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] :heavy_check_mark:
- [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec :heavy_check_mark:
- [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] :heavy_check_mark:
- [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
- [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code]
Code Comming Soon
- [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking :heavy_check_mark:
- [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec :heavy_check_mark:
- [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform :heavy_check_mark:
- [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
- [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer :heavy_check_mark:
- [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
- [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec :heavy_check_mark:
- [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC :heavy_check_mark:
- [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] :heavy_check_mark:
- [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation :heavy_check_mark:
- [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
- [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation :heavy_check_mark:
- [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
- [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
- [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues :heavy_check_mark:
- [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
- [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
- [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
- [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
- [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
- [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
- [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
- [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse :heavy_check_mark:
- [2024/11] hertz-dev [code] WaveCodec :heavy_check_mark:
- [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
- [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper]
Code Comming Soon
| proposing several codec-LM co-design strategies - [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
- [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free :heavy_check_mark:
- [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo]
Code Comming Soon
| Music Tokenizer, Similar to MsCodec - [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo]
Code Comming Soon
| transformer-based and scale it into 1B parameter range - [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
- [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo]
Code Comming Soon
| speaker encoder, content encoder and prosody encoder
注:以上论文集来自GitHub仓库Neural-Codec-and-Speech-Language-Models的一部分,欢迎star