概述

本发明旨在提供一种将多模态大模型集成到机器人中的方法和系统，从而赋予机器人更高水平的具身智能，并在展厅场景落地应用。具体包括以下几个方面：

1. 人机交互：通过语音唤醒、触屏开启对话或执行动作，使用语音活性检测判断录音结束，使用语音识别模型将用户语音转为文字输入大模型。最终将大模型输出转为语音播放，并触发对应动作，使机器人与用户实现流畅交互。

2. 大语言模型集成：将云端的百炼大语言模型（通义千问）集成至具身智能系统中，与控制算法控制的机械臂、导航算法控制的轮式底盘一起形成多智能体系统。能够理解自然语言输入，并生成适应上下文的语言输出，与用户进行多轮对话并判断出复杂的任务决策。

3. 多模态感知：通过结合机器人传感器数据（RGB-D摄像头、麦克风、定位），使大语言模型和视觉语言模型能够基于图片、语音、文字等模态信息感知和理解环境信息。

4. 记忆与决策模块：基于包括对话、动作、定位记录的短期记忆，和Dashvector向量数据库构成的长期记忆，使机器人能够根据用户指令和记忆做出行动决策。多智能体系统通过工具调用能力，调用身体部件实现运动、移动、语言输出等。

详述

1. 具身智能系统

(1) 语音唤醒

该节点负责响应麦克风内嵌的关键词监测系统的输出消息，通过ROS通信框架发送至决策引擎节点。决策引擎收到消息后杀死上一段对话进程，开启新的对话进程；或发送消息给机械臂控制节点触发动作。实现的功能包括启动对话录音、打断对话并立即开启新对话录音、启动动作、启动动作。

import subprocess
import rospy
from std_msgs.msg import String, Int32, Int8

# publish mesage to the LLM
def llm_pub(raw_prompt):
    global cloud_pub
    message = String()
    message.data = raw_prompt
    cloud_pub.publish(message)
    rate.sleep()


# stop the dialog process
def terminate():
    # kill all aplay sound
    subprocess.run(['pkill', '-9', 'aplay'])
    # kill TTS
    llm_pub('STOP')
    rate.sleep()


########################################################################
# ROS subscriber node
########################################################################
def wake_callback(ros_msg) -> None:
    """
    Callback function triggered upon a wake-up signal. Initiates or restarts the dialog process.

    Args:
    - msg (Any): The message containing wake-up signal information.
    """
    # start a new dialog
    global arm_pub, rate

    # wave hands and start conversation
    if ros_msg.data == 3:
        # stop the dialog
        terminate()
        # wave
        send_msg = Int8()
        send_msg.data = 4
        arm_pub.publish(send_msg)
        rate.sleep()
        rospy.loginfo(f"/arm_control 4 %s" % rospy.get_time())
        # play the greeting sound
        subprocess.run(["aplay", f"cozy_voice/hello.wav"])

    # perform taichi
    if ros_msg.data == 4:
        # stop all sound
        terminate()
        # do taichi
        send_msg = Int8()
        send_msg.data = 3
        arm_pub.publish(send_msg)
        rate.sleep()
        # play music
        subprocess.Popen(['aplay', 'cozy_voice/taiji.wav'])
        llm_pub('arm_control:3')


def main():
    # Initialize a ROS node with the name 'wake_subscriber'
    rospy.init_node('wake_subscriber')
    # Subscribe to the 'wake' topic, expecting messages of type Int32, and specify the callback function
    rospy.Subscriber('wake', Int32, wake_callback)
    # set up publishers for LLM and arms
    global cloud_pub, arm_pub, rate
    cloud_pub = rospy.Publisher('/cloud_llm', String, queue_size=10)
    arm_pub = rospy.Publisher('/arm_control', Int8, queue_size=10)
    rate = rospy.Rate(10)
    # The spin() function keeps Python from exiting until this ROS node is stopped
    rospy.spin()


if __name__ == '__main__':
    main()

(2) 触屏输入

该节点展示由图片初始消息引导用户唤醒对话，或引导用户点击屏幕按钮输入常见问题，并将文字输入决策引擎由大模型回答。触屏界面同时负责展示用户与大模型的文字输出，便于用户在嘈杂环境中辨认大模型输出内容。

(3) 语音识别

该节点在唤醒开启对话录音后，通过语音活性检测（VAD）判断存在人类声音时启动录音。如果人声消失，则停止录音，并由语音识别模型Paraformer将音频转换为文字。

import numpy as np
import pyaudio
from vad import VoiceActivityDetection
import subprocess
import collections
from array import array
from random import randint
import sys
import signal
import time
import torch
import urllib
from bs4 import BeautifulSoup
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import rospy
from std_msgs.msg import String, Int8
from record import listen
import os

########################################################################
# initialization
########################################################################
vad = VoiceActivityDetection()
# Global variables to manage recording state
recording_thread = None

########################################################################
# recording audio
########################################################################


# Main function to record
def record():
    audio = listen()
    if audio:
        subprocess.run(["aplay", "cozy_voice/think.wav"])
        start = time.time()
        raw_prompt = paraformer(audio, hotword=hotword)[0]['text']
        end = time.time()
        print('Time: ' + str(end - start))
        print(raw_prompt)
        if raw_prompt != '':
            llm_pub('llm:' + raw_prompt)


# stop the dialog process
def terminate():
    # kill all aplay sound
    subprocess.run(['killall', 'aplay'])
    # kill TTS
    llm_pub('STOP')
    rate.sleep()

# publish prompt
def pub_prompt(msg, raw_prompt):
    global cloud_pub
    cloud_pub.publish(msg)
    rate.sleep()


# publish ros message to llm
def llm_pub(msg):
    global cloud_pub, head_pub
    # publish message to llm
    message = String()
    message.data = msg
    cloud_pub.publish(message)
    # publish message to head to nod
    nod = Int8()
    nod.data = 1
    head_pub.publish(nod)
    rate.sleep()


def main():
    # Initialize a ROS node with the name 'wake_subscriber'
    rospy.init_node('mic')
    # Initialize publishers for LLM
    global cloud_pub
    rate = rospy.Rate(10)
    print('START')
    # keep listening
    while True:
        try:
            record()
        except Exception as e:
            print(f"Error encountered: {e}")
            pass


if __name__ == '__main__':
    main()

(4) 决策引擎

该节点由大语言模型(Qwen)和视觉语言模型(Qwen-VL)驱动，结合短期记忆与长期记忆对于用户的指令做出决策、执行任务。主要模块包括：

短期记忆

申昊科技人形机器人小昊-动作规划演示视频

- 动作记忆：记录动作类型以及触发时间，从而规划一系列复杂动作的顺序。

- 对话记忆：记录多轮对话的上下文，从而保持对话的连贯性和理解深度。

- 定位记忆：用于记录机器人当前的点位以获取对应位置的上下文信息，从而根据环境信息与用户交互。

长期记忆

申昊科技人形机器人小昊-识人聊天演示视频

- 在向量数据库中储存企业知识、行业知识等信息。通过用户指令的文本向量检索相关信息，并将检索结果作为上下文输入大模型生成对应回答。实现对专业知识的问答能力。

- 在向量数据库中检索人脸图片向量，返回人员信息的检索结果，实现大模型与对话人的定制话题问答，模拟人类认识人的能力。

from dashscope import TextEmbedding
from dashscope.api_entities.dashscope_response import Role
from dashvector import Client

########################################################################
# RAG
########################################################################
# generate text embedding
def generate_embeddings(doc):
    rsp = TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v2, input=doc)
    embeddings = [record['embedding'] for record in rsp.output['embeddings']]
    return embeddings if isinstance(doc, list) else embeddings[0]


# search for relevant text in the vector database
def search_relevant_doc(question, collection_name, topk):
    client = Client(api_key=os.getenv('DASHVECTOR_API_KEY'), endpoint=os.getenv('DASHVECTOR_ENDPOINT'))
    collection = client.get(collection_name)
    rsp = collection.query(question, output_fields=['raw'], topk=topk)
    # for face recognition, select the closest result
    if topk == 1:
        result = rsp.output[0]
        raw = rsp.output[0].fields['raw']
        score = result.score
        print(score)
        if score > 1.1:
            print(raw)
            return False
        else:
            return raw
    # for RAG, select the closest k results
    else:
        result = [raw.fields['raw'] for raw in rsp.output]
        return ';'.join(result)


# RAG
def rag(question, collection_name, topk):
    # search relevant document in dashvector collection
    embedding = generate_embeddings(question)
    context = search_relevant_doc(embedding, collection_name, topk)
    print(context)
    prompt = f'请基于```内的内容回答问题。```{context}```我的问题是：{question}。'
    return prompt

- 大语言模型: 调用云端百炼的大模型通义千问生成热启动的实例，输入用户指令，并传回流式回答文字，并输入云端的语音合成（TTS）模型CosyVoice生成语音。如果需要调用工具，如视觉、机械臂与导航，大语言模型会生成与任务相匹配的特殊符号。决策引擎识别到特殊符号后会通过ROS通信框架发送至对应节点执行任务。

from random import randint
from dashscope import Generation
from dashscope.api_entities.dashscope_response import Role

# chat with LLM
def chat(prompt):
    msg = memory['messages']
    msg.append({'role': Role.USER, 'content': prompt})
    # if the length of message exceeds 3k, pop the oldest round
    msg = check_len(msg)
    # Use language generation model to generate a response
    responses = Generation.call(
        model='qwen-plus',
        messages=msg,
        seed=randint(1, 100),
        enable_search=True,
        result_format='message',
        stream=True,
        incremental_output=True
    )
    # Process the generated response
    full_content = stream_tts(responses, prompt)
    # load the reply to the message
    msg.append({'role': Role.ASSISTANT, 'content': full_content})
    memory['messages'] = msg

- 视觉语言模型：当大语言模型判断需要调用视觉执行任务时，决策引擎会调用相机拍摄彩图感知物理世界环境，并连同照片和用户指令一起输入视觉语言模型，与大语言模型类似地传回流式回答。

from dashscope import MultiModalConversation
from dashscope.api_entities.dashscope_response import Role
import numpy as np
import mmap

# initialize camera
fd = os.open('/dev/shm/camera_7', os.O_RDONLY)
mmap_data = mmap.mmap(fd, length=640 * 480 * 3, access=mmap.ACCESS_READ)

def snap(photo_name):
    frame = np.frombuffer(mmap_data, dtype=np.uint8).reshape((480, 640, 3))
    cv2.imwrite(photo_name, frame)
    subprocess.run(["aplay", "cozy_voice/snap.wav"])

# chat with vision language model
def vl_chat(prompt):
    # take a photo
    photo_name = '/home/robot/chatbot/image/color_image.png'
    snap(photo_name)
    # load the prompt to the message
    msg = memory['vl']
    text_msg = memory['messages']
    text_msg.append({'role': Role.USER, 'content': prompt})
    msg.append({'role': Role.USER, 'content': [{'image': 'file://' + photo_name},
                                               {'text': '这张照片显示你刚拍摄的眼前的真实环境。请你观察照片回答问题:' + prompt}]})
    # get the reply from tongyi vl
    responses = MultiModalConversation.call(model='qwen-vl-max', messages=msg, stream=True, incremental_output=True)
    full_content = stream_tts(responses, prompt, vl=True)
    # load the reply to the message
    msg.append({'role': Role.ASSISTANT, 'content': [{'text': full_content}]})
    # load the reply to text message
    text_msg.append({'role': Role.ASSISTANT, 'content': full_content})
    # check if the length of text messages exceeds 5k
    text_msg = check_len(text_msg)
    msg = check_len(msg)
    memory['vl'] = msg
    memory['messages'] = text_msg
    # end the process
    terminate(llm=False, vl=True)

- 大模型+工具调用实现动作:

当大语言模型判断需要移动寻找物品时，会使用深度相机对目标物体测距，计算旋转角度。并将前行距离与角度通过ROS通信框架发送至主控算法，控制轮式底盘移动至目标处。
当大语言判断需要执行机械臂动作时，发送相应指令至机械臂控制算法以触发特定动作。
当大语言模型判断需要移动到地图上的特定点位时，会发送点位消息给导航系统，导航系统将测算自由路径并移动至点位。到达点位后，决策引擎接收导航系统的确认消息，并将该点位的环境信息以文字形式发送给大语言模型，使其能够感知周围环境。

2. 实现功能

迎宾表演

用户说出问候唤醒词，触发机器人挥手动作，并回复问候词，同时开启对话。用户说出表演唤醒词，触发机器人太极动作及音乐播放，开启娱乐表演。

申昊科技人形机器人小昊-太极表演

对话问答

点击触屏常见问题按钮或直接问询，触发机器人对话。由大模型驱动的具身智能系统使用语音、动作回复用户。

申昊科技人形机器人小昊-展会问答

目标追踪

用户下发指令要求机器人寻找物品时，机器人使用大语言模型理解任务，确认需要寻找的物品。决策引擎使用相机拍照，并使用视觉语言模型对物品定位，返回物品在照片中的像素坐标。决策引擎调用深度相机计算移动至目标物品的角度和距离，驱动轮式底盘到目标处。

申昊科技人形机器人小昊-目标追踪

展厅讲解

用户下发指令要求机器人移动至特定展区时，机器人使用大语言模型理解需要前往的点位，将点位发消息给导航系统。导航系统规划自由路径，移动至目标点位。决策引擎在收到抵达点位的消息后，系统发送介绍词自动给大模型作为上下文，实现与用户的对答。

申昊科技人形机器人小昊-展厅讲解

由通义千问驱动的人形机器人具身智能Multi-Agent系统

概述

详述

1. 具身智能系统

(1) 语音唤醒

(2) 触屏输入

(3) 语音识别

(4) 决策引擎

短期记忆

长期记忆

2. 实现功能

迎宾表演

对话问答

目标追踪

展厅讲解

通义大模型

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

由通义千问驱动的人形机器人具身智能Multi-Agent系统

概述

详述

1. 具身智能系统

(1) 语音唤醒

(2) 触屏输入

(3) 语音识别

(4) 决策引擎

短期记忆

长期记忆

2. 实现功能

迎宾表演

对话问答

目标追踪

展厅讲解

通义大模型

热门文章

最新文章

相关课程

相关电子书

相关实验场景