FunAudioLLM
音频基座大模型FunAudioLLM,包含两大模型SenseVoice和CosyVoice。开源代码库为https://github.com/FunAudioLLM。主要的作用SenseVoice是为了识别声音,CosyVoice则是为了生成有感情的朗读内容。
工作环境: https://www.modelscope.cn/studios/iic/SenseVoice,
https://www.modelscope.cn/studios/iic/CosyVoice-300M
语音识别最主要的功能就是方言的识别,尝试了一下甘肃的方言,没有识别:
使用音乐试一下是否可以生成想要的音乐模式,上传后可以识别,并且有对应的语气和情感的识别,功能还是非常具有眼前一亮的效果的。
识别语言的语气和情感则是区别机器和人的最重要的区别。并且可以实现不同语言语境的切换,非常厉害。
语言生成:
有多种的语音的生成,还可以具有不同的语言模式,整体生成的效果非常好,就和正常的朗读一般,这将对机器的发声非常有帮助。适应于不同的语境和不同的场景,例如机器拨打电话,就可以很大程度的模仿人类,或者车站播报,播报是相对于具有机械性质的语言模式。
AI代码分析,两个工程主要实现语言是python,工程的安装具有一定的挑战,并且AI的计算需要硬件的支持,这是非常大的痛点。
使用方式简单,就是只需要引入对应的module,然后调用即可。
from cosyvoice.cli.cosyvoice import CosyVoice
from cosyvoice.utils.file_utils import load_wav
import torchaudio
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT')
# sft usage
print(cosyvoice.list_avaliable_spks())
output = cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女')
torchaudio.save('sft.wav', output['tts_speech'], 22050)
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M')
# zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
output = cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k)
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050)
# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
output = cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k)
torchaudio.save('cross_lingual.wav', output['tts_speech'], 22050)
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct')
# instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]
output = cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.')
torchaudio.save('instruct.wav', output['tts_speech'], 22050)