如何在数字世界复刻一个高还原、高拟真的“你”

简介: 通过阿里云智能媒体服务IMS完成数字人形象训练、人声克隆定制,并使用Timeline实现视频合成及创作,打造一个“声形俱佳”的数字分身。

形象日益逼真、交互更为顺畅,虚拟数字人在视频内容生产环节的应用越来越多。复刻一个不受空间、时间的限制,具有专属声音、形象的“我”,竟如此简单?


我们先来看一则运用数字人形象定制及人声克隆后合成的成片效果


示例视频中的主播形象、人物语音,并非是通过拍摄剪辑形成的,而是通过数字人形象训练和人声克隆,生成对应的形象与声音模型,并调用智能媒体服务的视频剪辑合成接口生成的最终成片。


本文,我们将逐步介绍数字人形象训练及人声克隆的步骤过程,并附以详细的剪辑合成Timeline代码示例,三步在数字世界复刻一个“我”。


1. 数字人形象训练


首先,我们需要训练出一个数字分身,无论是动作、表情、还是口型等,都能够达到高还原度的拟真效果。


「数字人形象定制」主要包含准备训练素材、提交训练、生成AvatarId三个步骤。


1.1 准备训练素材


1个训练视频(Video)和1张头像图片(Portrait)


设备选择、场地要求、模特形象、录制过程等具体拍摄指南,可参考。



1.2 提交训练


通过控制台进行提交训练,依据提示,填入数字人名称、描述、头像图片与训练视频素材等,点击「开始定制」,即可开始训练。

控制台

其中,数字人形象支持传入带透明通道(无背景、已抠图)与不带透明通道(有背景、未抠图)两种不同类型的素材,可以按需传入。



1.3 生成AvatarId


当训练完成后,您将获得模型的唯一标识 AvatarId,可以使用该ID合成数字人视频。


同时,通过点击「数字人」列表页中的「视频剪辑」按钮,使用数字人形象生成视频,并进行二次剪辑创作。



您可以选择是由「文字驱动」或是「音频驱动」合成数字人视频。


以「文字驱动」为例:输入文本内容,选择官方人声效果(亦可选择通过人声克隆生成的声音模型,详见第二步)即可生成数字人视频。



数字人视频合成效果


2. 人声克隆


当有了高度拟真的数字人形象,我们还需要同步拟真的声音,进一步提升“数字分身”的生动感、完整度。


「人声克隆」主要包含准备训练素材、提交训练、生成VoiceId三个步骤。


2.1 准备训练素材


若干训练音频(Audio)和 1 个安全认证音频(Authentication)


2.2 提交训练


通过控制台提交训练,填入声音名称、安全认证音频、训练音频等,点击「开始定制」,即可开始训练。

控制台



2.3 生成VoiceId


训练完成后,您将获得模型的唯一标识 VoiceId,可以通过点击「人声克隆 - 大众版」列表页中的「语音合成」按钮,来使用当前声音模型进行语音合成。


人声克隆-大众版列表页


输入文字内容,点击「开始试听」按钮,即可提交智能语音合成。



运用人生克隆语音合成后效果如下:

语音合成示例音频 1


我们也可以与训练时传入的原始音频进行效果对比:

原始训练声音(截取)


3. 一键“复刻”


在获得数字人形象模型的 AvatarId 与专属人声的 VoiceId 后,除了通过文章前述的控制台来进行提交合成,还可以通过时间线Timeline一次提交生成,完成一键“复刻”。


Timeline示例如下:


{
"VideoTracks": [{
"VideoTrackClips": [{
"Type": "GlobalImage",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image1.jpg",
"Height": 1920,
"Width": 1080        }]
    }, {
"VideoTrackClips": [{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "什么是数字人和人声克隆?",
"Alignment": "CenterCenter",
"FontSize": 80,
"EffectColorStyle": "CS0001-000001"            }]
        },{
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "数字人和人声克隆是两个前沿的技术概念\n它们代表着数字技术在模拟人类外观和声音方面的最新成就",
"ClipId": "avatar1",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "数字人",
"Alignment": "CenterCenter",
"FontSize": 150,
"EffectColorStyle": "CS0001-000001"            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "数字人\n通常指由计算机生成的虚拟人类形象\n这些形象可以是二维(2D)或三维(3D)的\n具有与现实人类相似的外观、动作和行为\n随着图形渲染技术的进步\n数字人越来越能够以高度逼真的方式呈现\n包括复杂的面部表情、肢体动作\n并能在虚拟环境中以自然的方式行动",
"ClipId": "avatar2",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "数字人的应用",
"Alignment": "CenterCenter",
"FontSize": 120,
"EffectColorStyle": "CS0001-000001"            }]
        },{
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "数字人的应用范围极其广泛",
"ClipId": "avatar3",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "从电子游戏、电影和电视制作",
"ClipId": "avatar4",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "到增强现实(AR)和虚拟现实(VR)体验等",
"ClipId": "avatar5",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "到在线教育、模拟训练、客户服务和健康护理",
"ClipId": "avatar6",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "在这些应用中\n数字人可以作为用户的虚拟代表\n或者作为虚拟助手和顾问\n提供帮助和咨询\n在娱乐产业中",
"ClipId": "avatar7",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "数字人可以被用来创造虚构角色\n甚至在某些情况下代替真实的演员进行表演",
"ClipId": "avatar8",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "而在商业领域\n数字人可以用作品牌大使或虚拟员工\n增加用户互动的吸引力",
"ClipId": "avatar9",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "在教育和训练场景中\n数字人能够模拟不同的情境\n提供更加丰富和互动的学习体验",
"ClipId": "avatar10",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "人声克隆",
"Alignment": "CenterCenter",
"FontSize": 150,
"EffectColorStyle": "CS0001-000001"            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "人声克隆是指使用计算机算法模仿特定人的声音\n这项技术通过分析一个人的语音记录\n捕捉其独特的声音特征\n如音高、音色、语速和口音\n并创建一个可以产生类似声音的模型\n人声克隆技术往往建立在深度学习和神经网络的基础上\n通过大量的声音训练数据\n使得合成的声音越来越难以与原声区分",
"ClipId": "avatar11",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "人声克隆的应用",
"Alignment": "CenterCenter",
"FontSize": 120,
"EffectColorStyle": "CS0001-000001"            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "人声克隆可以用在多个方面",
"ClipId": "avatar12",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "例如为无法亲自录音的艺术家复原声音",
"ClipId": "avatar13",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "或为语音辅助设备提供更自然的语音输出",
"ClipId": "avatar14",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "它也可以帮助有语音障碍的人恢复他们的声音",
"ClipId": "avatar15",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "或用于个性化的语音合成服务\n如定制语音导航或个人助理",
"ClipId": "avatar16",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "人声克隆技术可以为数字人提供声线\n使得虚拟角色不仅在视觉上\n还在听觉上都显得栩栩如生\n具有高度自然语音的数字人能够提供更加动态和亲切的交互体验\n从而在各种虚拟场景中担任重要角色",
"ClipId": "avatar17",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image2.png",
"Height": 1920,
"Width": 1080,
"Duration": 2,
"Effects": [{
"Type": "Text",
"Content": "总结",
"Alignment": "CenterCenter",
"FontSize": 150,
"EffectColorStyle": "CS0001-000001"            }]
        }, {
"Type": "AI_Avatar",
"AvatarId": "Avatar-******",
"CustomizedVoice": "Voice-******",
"Content": "结合数字人和人声克隆技术\n我们可以创造出能够在屏幕上以二维形象出现并以逼真的人声进行沟通的虚拟代表\n这种结合提供了丰富的用户体验\n并在教育、娱乐、客户服务等各种场景中拥有潜在的应用价值",
"ClipId": "avatar18",
"Effects": [{
"Type": "AI_ASR",
"FontSize": 60,
"Alignment": "TopCenter",
"Y": 1670,
"EffectColorStyle": "CS0001-000007",
"AdaptMode": "AutoWrap",
"TextWidth": 0.8            }]
        }]
    },{
"VideoTrackClips": [{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image3.png",
"ReferenceClipId": "avatar4"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image4.png",
"ReferenceClipId": "avatar5"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image5.png",
"ReferenceClipId": "avatar6"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image6.png",
"ReferenceClipId": "avatar8"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image7.png",
"ReferenceClipId": "avatar9"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image8.png",
"ReferenceClipId": "avatar10"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image9.png",
"ReferenceClipId": "avatar13"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image10.png",
"ReferenceClipId": "avatar14"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image11.png",
"ReferenceClipId": "avatar15"        },{
"Type": "Image",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image12.png",
"ReferenceClipId": "avatar16"        }]
    },{
"VideoTrackClips": [{
"Type": "GlobalImage",
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/image13.png",
"Width": 312,
"Height": 72,
"X": 0,
"Y": 1848        }]
    }],
"AudioTracks": [{
"AudioTrackClips": [{
"MediaURL": "https://your-bucket.oss-cn-shanghai.aliyuncs.com/audio1.wav",
"LoopMode": true,
"Effects": [{
"Type": "Volume",
"Gain": "0.2"            }]
        }]
    }]
}


各API文档链接:

1.SubmitAvatarVideoJob - 提交数字人视频合成任务

2.SubmitAudioProduceJob - 提交智能语音任务

3.SubmitMediaProducingJob - 提交剪辑合成任务


值得一提的是,本示例中用到的口播文案、图片、音乐等都完全是由各类生成模型生成的,也就是说,该视频的输入只有一段话:“请介绍一下数字人与人声克隆技术”。


敬请期待云端智能剪辑即将推出的「一键成片」功能,仅需要输入一个词、一段话,和极其简单的配置,即可生成一段高质量的视频。


智能媒体服务IMS「云端智能剪辑」是基于云计算和人工智能技术的视频剪辑生产服务,能为用户提供直播剪辑、视频剪辑、模版工厂、数字人制作等核心功能,并可使用 AI 辅助剪辑生产。该产品可广泛应用于互联网、文化传媒、广告营销、教育金融等行业,满足企业进行规模、高效、便捷、智能的视频内容生产需求。


欢迎加入官方答疑钉钉群咨询交流:48335001108

橙鲤|作者

查看「数字人与人声克隆」产品文档

作者介绍
目录