技术背景
随着智慧数字人、AI数字人的兴起,越来越多的公司着手构建全息、真实感数字角色等技术合成的数字仿真人虚拟形象,通过“虚拟形象+语音交互(T-T-S、ASR)+自然语言理解(NLU)+深度学习”,构建适用于数字客服、虚拟展厅讲解、 智慧城市、智慧医疗、智慧教育等场景,通过人机可视化语音交互,释放人员基础劳动力,降低运营成本,提升智慧交互体验。
一个有“温度”的智慧数字人,有多个维度组成,如图像识别、语音识别、语义理解等,本文主要阐述的是如何把这样一个智慧数字人,通过编码传输,以更低的延迟和好的体验,呈现给用户。
技术实现
本文以Windows平台为例,从技术角度探讨智慧数字人的实时编码传输。先上图:
左侧是Unity采集、获取video Texture和AudioClip数据,编码打包后,然后通过RTMP推送到服务端,右下侧实时拉取RTMP流数据播放,整体延迟在毫秒级。
视频采集这块,实现了Unity获取到的Texture数据的采集、摄像头采集、屏幕采集三大类:
public void SelVideoPushType(int type) { switch (type) { case 0: video_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_LAYER; //采集Unity窗体 break; case 1: video_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_CAMERA; //采集摄像头 break; case 2: video_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_SCREEN; //采集屏幕 break; case 3: video_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_NO_VIDEO; //不采集视频 break; } Debug.Log("SelVideoPushType type: " + type + " video_push_type: " + video_push_type_); }
音频采集部分,我们主要实现了采集AudioClip的声音、麦克风、扬声器、还有两路AudioClip的音频混音:
public void SelAudioPushType(int type) { switch (type) { case 0: audio_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_AUDIO_OPTION.NT_PB_E_AUDIO_OPTION_EXTERNAL_PCM_DATA; //采集Unity声音 break; case 1: audio_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_AUDIO_OPTION.NT_PB_E_AUDIO_OPTION_CAPTURE_MIC; //采集麦克风 break; case 2: audio_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_AUDIO_OPTION.NT_PB_E_AUDIO_OPTION_CAPTURE_SPEAKER; //采集扬声器 break; case 3: audio_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_AUDIO_OPTION.NT_PB_E_AUDIO_OPTION_TWO_EXTERNAL_PCM_MIXER; //两路Unity AudioClip混音 break; case 4: audio_push_type_ = (uint)NTSmartPublisherDefine.NT_PB_E_AUDIO_OPTION.NT_PB_E_AUDIO_OPTION_NO_AUDIO; //不采集音频 break; } Debug.Log("SelAudioPushType type: " + type + " audio_push_type: " + audio_push_type_); }
为了便于测试延迟,在页面加了个简单的时间日期刷新:
//获取当前时间 GameObject.Find("Canvas/Panel/LableText").GetComponent<Text>().text = string.Format("{0:D2}:{1:D2}:{2:D2}:{3:D2} " + "{4:D4}/{5:D2}/{6:D2}", DateTime.Now.Hour, DateTime.Now.Minute, DateTime.Now.Second, DateTime.Now.Millisecond, DateTime.Now.Year, DateTime.Now.Month, DateTime.Now.Day);
Unity窗体或Camera采集,可以从Texuture拿到数据,从而获取到rgb数据,投递到封装的wrapper层,实现编码传输。
if (texture_ == null || video_width_ != Screen.width || video_height_ != Screen.height) { Debug.Log("OnPostRender screen changed++ scr_width: " + Screen.width + " scr_height: " + Screen.height); if (screen_image_ != IntPtr.Zero) { Marshal.FreeHGlobal(screen_image_); screen_image_ = IntPtr.Zero; } if (texture_ != null) { UnityEngine.Object.Destroy(texture_); texture_ = null; } video_width_ = Screen.width; video_height_ = Screen.height; texture_ = new Texture2D(video_width_, video_height_, TextureFormat.BGRA32, false); screen_image_ = Marshal.AllocHGlobal(video_width_ * 4 * video_height_); Debug.Log("OnPostRender screen changed--"); return; } texture_.ReadPixels(new Rect(0, 0, video_width_, video_height_), 0, 0, false); texture_.Apply();
摄像头和屏幕采集,可以直接在封装层实现,如果需要做预览,只需要把数据回到Unity,通过RawImage实时刷新Texture显示即可。
通过封装层实现数据预览:
public bool StartPreview() { if(CheckPublisherHandleAvailable() == false) return false; video_preview_image_callback_ = new NT_PB_SDKVideoPreviewImageCallBack(SDKVideoPreviewImageCallBack); NTSmartPublisherSDK.NT_PB_SetVideoPreviewImageCallBack(publisher_handle_, (int)NTSmartPublisherDefine.NT_PB_E_IMAGE_FORMAT.NT_PB_E_IMAGE_FORMAT_RGB32, IntPtr.Zero, video_preview_image_callback_); if (NTBaseCodeDefine.NT_ERC_OK != NTSmartPublisherSDK.NT_PB_StartPreview(publisher_handle_, 0, IntPtr.Zero)) { if (0 == publisher_handle_count_) { NTSmartPublisherSDK.NT_PB_Close(publisher_handle_); publisher_handle_ = IntPtr.Zero; } return false; } publisher_handle_count_++; is_previewing_ = true; return true; } public void StopPreview() { if (is_previewing_ == false) return; is_previewing_ = false; publisher_handle_count_--; NTSmartPublisherSDK.NT_PB_StopPreview(publisher_handle_); if (0 == publisher_handle_count_) { NTSmartPublisherSDK.NT_PB_Close(publisher_handle_); publisher_handle_ = IntPtr.Zero; } }
预览数据回调:
//预览数据回调 public void SDKVideoPreviewImageCallBack(IntPtr handle, IntPtr user_data, IntPtr image) { NT_PB_Image pb_image = (NT_PB_Image)Marshal.PtrToStructure(image, typeof(NT_PB_Image)); NT_VideoFrame pVideoFrame = new NT_VideoFrame(); pVideoFrame.width_ = pb_image.width_; pVideoFrame.height_ = pb_image.height_; pVideoFrame.stride_ = pb_image.stride_[0]; Int32 argb_size = pb_image.stride_[0] * pb_image.height_; pVideoFrame.plane_data_ = new byte[argb_size]; if (argb_size > 0) { Marshal.Copy(pb_image.plane_[0],pVideoFrame.plane_data_,0, argb_size); } { cur_image_ = pVideoFrame; } }
音频采集这块,Unity环境下,主要是采集Unity的AudioClip数据,这块需要注意的是,PCM数据发送间隔,每隔10毫秒发一次,因为AudioClip的size比如可能只有十几秒或者几分钟,需要考虑的是,AudioClip数据采集播放完毕后,是loop的形式反复播放,还是静音帧的形式,只传视频,不传音频。
var pcm_data = new PCMData(); pcm_data.sample_rate_ = audio_clip_info_.audio_clip_.frequency; pcm_data.channels_ = audio_clip_info_.audio_clip_.channels; pcm_data.per_channel_sample_number_ = pcm_data.sample_rate_ / 100; var pcm_sample = new float[pcm_data.sample_rate_ * pcm_data.channels_ / 100]; audio_clip_info_.audio_clip_.GetData(pcm_sample, audio_clip_info_.audio_clip_offset_); var sample_length = sizeof(float) * pcm_sample.Length; pcm_data.data_ = Marshal.AllocHGlobal(sample_length); Marshal.Copy(pcm_sample, 0, pcm_data.data_, pcm_sample.Length); pcm_data.size_ = (uint)sample_length; publisher_wrapper_.OnPostAudioPCMFloatData(pcm_data.data_, pcm_data.size_, pcm_time_stamp_, pcm_data.sample_rate_, pcm_data.channels_, pcm_data.per_channel_sample_number_); Marshal.FreeHGlobal(pcm_data.data_); pcm_data.data_ = IntPtr.Zero; pcm_data = null; pcm_time_stamp_ += 10; //时间戳自增10毫秒
如果要两路混音,只要再从Resources下面,获取另一路AudioClip数据,然后投递即可:
audio_clip_info_mix_ = new AudioClipInfo(); audio_clip_info_mix_.audio_clip_ = Resources.Load("AudioData/music") as AudioClip;
数据投递,用以下接口:
publisher_wrapper_.OnPostAudioExternalPCMFloatMixerData(pcm_data_mix.data_, pcm_data_mix.size_, pcm_time_stamp_mix_, pcm_data_mix.sample_rate_, pcm_data_mix.channels_, pcm_data_mix.per_channel_sample_number_);
数据采集投递过来后,我们以图层的形式投递过来,设置音视频编码参数,底层实现音视频编码:
/* * nt_publisher_wrapper.cs * nt_publisher_wrapper * * Github: https://github.com/daniulive/SmarterStreaming * * Created by DaniuLive on 2017/11/14. */ private void SetCommonOptionToPublisherSDK() { if (!IsPublisherHandleAvailable()) { Debug.Log("SetCommonOptionToPublisherSDK, publisher handle with null.."); return; } NTSmartPublisherSDK.NT_PB_ClearLayersConfig(publisher_handle_, 0, 0, IntPtr.Zero); if (video_option_ == (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_LAYER) { // 第0层填充RGBA矩形, 目的是保证帧率, 颜色就填充全黑 int red = 0; int green = 0; int blue = 0; int alpha = 255; NT_PB_RGBARectangleLayerConfig rgba_layer_c0 = new NT_PB_RGBARectangleLayerConfig(); rgba_layer_c0.base_.type_ = (Int32)NTSmartPublisherDefine.NT_PB_E_LAYER_TYPE.NT_PB_E_LAYER_TYPE_RGBA_RECTANGLE; rgba_layer_c0.base_.index_ = 0; rgba_layer_c0.base_.enable_ = 1; rgba_layer_c0.base_.region_.x_ = 0; rgba_layer_c0.base_.region_.y_ = 0; rgba_layer_c0.base_.region_.width_ = video_width_; rgba_layer_c0.base_.region_.height_ = video_height_; rgba_layer_c0.base_.offset_ = Marshal.OffsetOf(rgba_layer_c0.GetType(), "base_").ToInt32(); rgba_layer_c0.base_.cb_size_ = (uint)Marshal.SizeOf(rgba_layer_c0); rgba_layer_c0.red_ = System.BitConverter.GetBytes(red)[0]; rgba_layer_c0.green_ = System.BitConverter.GetBytes(green)[0]; rgba_layer_c0.blue_ = System.BitConverter.GetBytes(blue)[0]; rgba_layer_c0.alpha_ = System.BitConverter.GetBytes(alpha)[0]; IntPtr rgba_conf = Marshal.AllocHGlobal(Marshal.SizeOf(rgba_layer_c0)); Marshal.StructureToPtr(rgba_layer_c0, rgba_conf, true); UInt32 rgba_r = NTSmartPublisherSDK.NT_PB_AddLayerConfig(publisher_handle_, 0, rgba_conf, (int)NTSmartPublisherDefine.NT_PB_E_LAYER_TYPE.NT_PB_E_LAYER_TYPE_RGBA_RECTANGLE, 0, IntPtr.Zero); Marshal.FreeHGlobal(rgba_conf); NT_PB_ExternalVideoFrameLayerConfig external_layer_c1 = new NT_PB_ExternalVideoFrameLayerConfig(); external_layer_c1.base_.type_ = (Int32)NTSmartPublisherDefine.NT_PB_E_LAYER_TYPE.NT_PB_E_LAYER_TYPE_EXTERNAL_VIDEO_FRAME; external_layer_c1.base_.index_ = 1; external_layer_c1.base_.enable_ = 1; external_layer_c1.base_.region_.x_ = 0; external_layer_c1.base_.region_.y_ = 0; external_layer_c1.base_.region_.width_ = video_width_; external_layer_c1.base_.region_.height_ = video_height_; external_layer_c1.base_.offset_ = Marshal.OffsetOf(external_layer_c1.GetType(), "base_").ToInt32(); external_layer_c1.base_.cb_size_ = (uint)Marshal.SizeOf(external_layer_c1); IntPtr external_layer_conf = Marshal.AllocHGlobal(Marshal.SizeOf(external_layer_c1)); Marshal.StructureToPtr(external_layer_c1, external_layer_conf, true); UInt32 external_r = NTSmartPublisherSDK.NT_PB_AddLayerConfig(publisher_handle_, 0, external_layer_conf, (int)NTSmartPublisherDefine.NT_PB_E_LAYER_TYPE.NT_PB_E_LAYER_TYPE_EXTERNAL_VIDEO_FRAME, 0, IntPtr.Zero); Marshal.FreeHGlobal(external_layer_conf); } else if (video_option_ == (uint)NTSmartPublisherDefine.NT_PB_E_VIDEO_OPTION.NT_PB_E_VIDEO_OPTION_CAMERA) { CameraInfo camera = cameras_[cur_sel_camera_index_]; NT_PB_VideoCaptureCapability cap = camera.capabilities_[cur_sel_camera_resolutions_index_]; SetVideoCaptureDeviceBaseParameter(camera.id_.ToString(), (UInt32)cap.width_, (UInt32)cap.height_); } SetFrameRate((uint)video_fps_); Int32 type = 0; //软编码 Int32 encoder_id = 1; UInt32 codec_id = (UInt32)NTCommonMediaDefine.NT_MEDIA_CODEC_ID.NT_MEDIA_CODEC_ID_H264; Int32 param1 = 0; SetVideoEncoder(type, encoder_id, codec_id, param1); SetVideoQualityV2(CalVideoQuality(video_width_, video_height_, is_h264_encoder_)); SetVideoBitRate(CalBitRate(video_fps_, video_width_, video_height_)); SetVideoMaxBitRate((CalMaxKBitRate(video_fps_, video_width_, video_height_, false))); SetVideoKeyFrameInterval((key_frame_interval_)); if (is_h264_encoder_) { SetVideoEncoderProfile(1); } SetVideoEncoderSpeed(CalVideoEncoderSpeed(video_width_, video_height_, is_h264_encoder_)); // 音频相关设置 SetAuidoInputDeviceId(0); SetPublisherAudioCodecType(1); SetPublisherMute(is_mute_); SetEchoCancellation(0, 0); SetNoiseSuppression(0); SetAGC(0); SetVAD(0); SetInputAudioVolume(Convert.ToSingle(audio_input_volume_)); }
编码打包后,可以调用推送接口,把打包后的数据,实时传到RTMP服务端:
public bool StartPublisher(String url) { if (CheckPublisherHandleAvailable() == false) return false; if (publisher_handle_ == IntPtr.Zero) { return false; } if (!String.IsNullOrEmpty(url)) { NTSmartPublisherSDK.NT_PB_SetURL(publisher_handle_, url, IntPtr.Zero); } if (NTBaseCodeDefine.NT_ERC_OK != NTSmartPublisherSDK.NT_PB_StartPublisher(publisher_handle_, IntPtr.Zero)) { if (0 == publisher_handle_count_) { NTSmartPublisherSDK.NT_PB_Close(publisher_handle_); publisher_handle_ = IntPtr.Zero; } is_publishing_ = false; return false; } publisher_handle_count_++; is_publishing_ = true; return true; } public void StopPublisher() { if (is_publishing_ == false) return; publisher_handle_count_--; NTSmartPublisherSDK.NT_PB_StopPublisher(publisher_handle_); if (0 == publisher_handle_count_) { NTSmartPublisherSDK.NT_PB_Close(publisher_handle_); publisher_handle_ = IntPtr.Zero; } is_publishing_ = false; }
RTMP传输这块,需要把Event状态回调给Unity,确保Unity实时处理网络异常:
Unity层处理:
public event Action<uint,string> OnLogEventMsg; publisher_wrapper_.OnLogEventMsg += OnLogHandle; private void OnLogHandle(uint arg1, string arg2) { Debug.Log(arg2); }
wrapper层处理:
private void PbEventCallBack(IntPtr handle, IntPtr user_data, UInt32 event_id, Int64 param1, Int64 param2, UInt64 param3, UInt64 param4, [MarshalAs(UnmanagedType.LPStr)] String param5, [MarshalAs(UnmanagedType.LPStr)] String param6, IntPtr param7) { String event_log = ""; switch (event_id) { case (uint)NTSmartPublisherDefine.NT_PB_E_EVENT_ID.NT_PB_E_EVENT_ID_CONNECTING: event_log = "连接中"; if (!String.IsNullOrEmpty(param5)) { event_log = event_log + " url:" + param5; } break; case (uint)NTSmartPublisherDefine.NT_PB_E_EVENT_ID.NT_PB_E_EVENT_ID_CONNECTION_FAILED: event_log = "连接失败"; if (!String.IsNullOrEmpty(param5)) { event_log = event_log + " url:" + param5; } break; case (uint)NTSmartPublisherDefine.NT_PB_E_EVENT_ID.NT_PB_E_EVENT_ID_CONNECTED: event_log = "已连接"; if (!String.IsNullOrEmpty(param5)) { event_log = event_log + " url:" + param5; } break; case (uint)NTSmartPublisherDefine.NT_PB_E_EVENT_ID.NT_PB_E_EVENT_ID_DISCONNECTED: event_log = "断开连接"; if (!String.IsNullOrEmpty(param5)) { event_log = event_log + " url:" + param5; } break; default: break; } if(OnLogEventMsg != null) OnLogEventMsg.Invoke(event_id, event_log); }
总结
以上是大概的流程,通过采集Unity的音视频数据,编码打包传输,发送到RTMP服务端,客户端直接拉取RTMP流数据,延迟在毫秒级,用户体验良好,在智慧数字人等交互场景,体验极佳。