在云上搭建CosyVoice环境-保姆级教程

2024-07-12 5640

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 发现个好玩的模型，阿里最近开源的，可以便捷的复刻人声，本文提供全套安装过程。仓库地址：https://github.com/FunAudioLLM/CosyVoice。

安装步骤：

以在Ubuntu 20.04 64位系统（该系统有对应的cuda驱动，使用较方便）为例：
1、创建GPU实例（本例选择ecs.gn6i-c8g1.2xlarge，1 * NVIDIA T4，显存16GB。）。设置用户名及登录密码。整套系统完全安装，超过40G，建议至少50G存储。同时选择安装CUDA。

2、设置安全组配置，配置出方向端口22，并在源IP中加入本机IP。
3、Ssh到云ECS：sudo apt-get update。如果是root登录，系统会提示正在安装CUDA。待安装完毕，查看GPU详情（若命令无法正常运行，则说明CUDA没安装好）：nvidia-smi

4、Clone the repo：
A、请确保已安装git lfs（sudo apt install git ；sudo apt install git-lfs）：
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
B、安装miniconda：
（1）下载安装包：注意miniconda包需选择对应python版本的包
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.11.0-2-Linux-x86_64.sh
（2）运行安装脚本，并初始化：bash Miniconda3-py38_23.11.0-2-Linux-x86_64.sh
（3）（可在2中完成）初始化终端 Shell，以便运⾏conda。~/miniconda3/bin/conda init
（4）初始化完成后，运行bash命令，即可进入conda环境：bash
（5）创建⼀个新的环境： conda create -n cosyvoice python=3.8 -y
（6）激活aigc环境：conda activate cosyvoice
（7）安装依赖包：
cd CosyVoice/
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
安装peft：pip3 install peft

5、Model download：模型下载
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

可选：安装ttsfrd
sudo apt install unzip
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip3 install ttsfrd-0.3.6-cp38-cp38-linux_x86_64.whl

6、运行web DEMO，并把模型的本地服务端口50000直接映射到自己的本地便携上，远程登录使用：
A、本地便携机上执行如下命令，将云ECS的50000端口映射到本地（IP及用户名填实际的）：ssh -L50000:localhost:50000 ecs-user@ecs公网IP
B、在ECS上运行脚本：
conda activate cosyvoice
cd CosyVoice/
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
C、在本地浏览器登录web界面，输入该链接：http://127.0.0.1:50000
就可以在本机上使用了。

关键代码解读：

分析webui.py中调用的接口（可结合readme中提供的Basic Usage代码示例），基本上就可以弄清楚CosyVoice提供的主要功能，以及这些功能的使用方法。主要函数是generate_audio，主要提供四大功能点：
1、自然语言控制：使用inference_instruct 接口，对应CosyVoice-300M-Instruct模型，该模型不需要输入prompt音频和prompt文本。入参：预训练音色(sft_dropdown)，如中文女、中文男等；instruct文本，需用户手工在界面上输入。
output = cosyvoice.inference_instruct(tts_text, sft_dropdown, instruct_text)
2、跨语种复刻：使用inference_cross_lingual 接口，对应CosyVoice-300M模型。要求合成文本和prompt文本为不同语言。入参：prompt音频。
output = cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k)
3、3s极速复刻：使用inference_zero_shot接口，对应CosyVoice-300M模型。入参：prompt文本和音频。
output = cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_speech_16k)

4、预训练音色：使用inference_sft接口，对应CosyVoice-300M-SFT模型。入参：预训练音色(sft_dropdown)，如中文女、中文男等。
output = cosyvoice.inference_sft(tts_text, sft_dropdown)

其他参数说明：
tts_text：待合成音频的文本输入。
prompt_wav：用户上传录制好的音频，或直接录制音频素材。用于后续音色复刻。针对上传的音频还有采样率是否满足要求的检测。且原始音频在使用前会先调用postprocess函数做预处理。

def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, prompt_wav_upload, prompt_wav_record, instruct_text, seed):
    if prompt_wav_upload is not None:
        prompt_wav = prompt_wav_upload   //用户上传录制好的音频
    elif prompt_wav_record is not None:
        prompt_wav = prompt_wav_record   //用户在操作界面上录制音频
    else:
        prompt_wav = None
    # if instruct mode, please make sure that model is iic/CosyVoice-300M-Instruct and not cross_lingual mode
    if mode_checkbox_group in ['自然语言控制']:
        if cosyvoice.frontend.instruct is False:
            gr.Warning('您正在使用自然语言控制模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M-Instruct模型'.format(args.model_dir))
            return (target_sr, default_data)
        if instruct_text == '':
            gr.Warning('您正在使用自然语言控制模式, 请输入instruct文本')
            return (target_sr, default_data)
        if prompt_wav is not None or prompt_text != '':
            gr.Info('您正在使用自然语言控制模式, prompt音频/prompt文本会被忽略')
    # if cross_lingual mode, please make sure that model is iic/CosyVoice-300M and tts_text prompt_text are different language
    if mode_checkbox_group in ['跨语种复刻']:
        if cosyvoice.frontend.instruct is True:
            gr.Warning('您正在使用跨语种复刻模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M模型'.format(args.model_dir))
            return (target_sr, default_data)
        if instruct_text != '':
            gr.Info('您正在使用跨语种复刻模式, instruct文本会被忽略')
        if prompt_wav is None:
            gr.Warning('您正在使用跨语种复刻模式, 请提供prompt音频')
            return (target_sr, default_data)
        gr.Info('您正在使用跨语种复刻模式, 请确保合成文本和prompt文本为不同语言')
    # if in zero_shot cross_lingual, please make sure that prompt_text and prompt_wav meets requirements
    if mode_checkbox_group in ['3s极速复刻', '跨语种复刻']:
        if prompt_wav is None:
            gr.Warning('prompt音频为空，您是否忘记输入prompt音频？')
            return (target_sr, default_data)
        if torchaudio.info(prompt_wav).sample_rate < prompt_sr:
            gr.Warning('prompt音频采样率{}低于{}'.format(torchaudio.info(prompt_wav).sample_rate, prompt_sr))
            return (target_sr, default_data)
    # sft mode only use sft_dropdown
    if mode_checkbox_group in ['预训练音色']:
        if instruct_text != '' or prompt_wav is not None or prompt_text != '':
            gr.Info('您正在使用预训练音色模式，prompt文本/prompt音频/instruct文本会被忽略！')
    # zero_shot mode only use prompt_wav prompt text
    if mode_checkbox_group in ['3s极速复刻']:
        if prompt_text == '':
            gr.Warning('prompt文本为空，您是否忘记输入prompt文本？')
            return (target_sr, default_data)
        if instruct_text != '':
            gr.Info('您正在使用3s极速复刻模式，预训练音色/instruct文本会被忽略！')

    if mode_checkbox_group == '预训练音色':
        logging.info('get sft inference request')
        set_all_random_seed(seed)
        output = cosyvoice.inference_sft(tts_text, sft_dropdown)
    elif mode_checkbox_group == '3s极速复刻':
        logging.info('get zero_shot inference request')
        prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr))  //音频预处理，采样率要求不低于16K
        set_all_random_seed(seed)
        output = cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_speech_16k)
    elif mode_checkbox_group == '跨语种复刻':
        logging.info('get cross_lingual inference request')
        prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr))
        set_all_random_seed(seed)
        output = cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k)
    else:
        logging.info('get instruct inference request')
        set_all_random_seed(seed)
        output = cosyvoice.inference_instruct(tts_text, sft_dropdown, instruct_text)
    audio_data = output['tts_speech'].numpy().flatten()
    return (target_sr, audio_data)

在云上搭建CosyVoice环境-保姆级教程

安装步骤：

关键代码解读：

通义大模型

热门文章

最新文章

相关电子书