其实市面上已经有很多类似的功能,比如Kimi Chat等,都有类似功能。本文是此类功能的一种实现方式,并且与微信联系了起来。
1. 接收用户消息并处理
详细教程链接(如何接收用户消息,如何给用户回复消息等):【Python+微信】【企业微信开发入坑指北】1. 数据链路打通:接收用户消息处理并回复
1.1 主要代码
在之前代码的基础上,我们增加一下以下判断,只要发来的消息中带有 http 或 https 字样,就认为是一个url,就走我们本文的处理:爬取内容,总结内容。
if decrypt_data.get('Content', '').find("http://") != -1 or decrypt_data.get('Content', '').find("https://") != -1: response_content = await summary_url(decrypt_data.get('Content', ''))
当然,我也知道这种方式非常地… 不好,很容易误判,但对于目前的我来说够用了。实际工程中可能会考虑需要用户输入特定指令来触发该功能等更严格和标准的方式。
1.2 注意事项
我这里也简单解决了一下该问题:用一个 Messages存储接收到的消息,当该消息已经在里面的时候,视为重试消息,直接返回一个 “success”。不在里面时,才视为新消息,触发后面的逻辑。这样保证了不会重复处理相同的请求。
Messages = {} msg_id = decrypt_data.get('MsgId') if (msg_id in Messages.keys()): return "success" Messages[msg_id] = decrypt_data
主动给企业微信发消息的教程参考:【Python+微信】【企业微信开发入坑指北】2. 如何利用企业微信API主动给用户发应用消息
if decrypt_data.get('Content', '').find("http://") != -1 or decrypt_data.get('Content', '').find("https://") != -1: response_content = await summary_url(decrypt_data.get('Content', '')) weichat_app.send_text(content=response_content)
2. 后台服务处理程序(通用爬虫+总结)
2.1 通用爬虫
获取网页中的内容,就是前面我们探讨过的爬虫的内容(【Python实用技能】爬虫升级之路:从专用爬虫到用AI Agent实现通用网络爬虫(适合小白)),为了能满足用户输入的任意URL,我们需要使用进阶版的通用爬虫,也就是利用AI Agent实现通用爬虫。
2.1.1 大模型写爬虫代码
def get_outline(page): soup = _get_soup(page.html) outline = [] def process_element(element, depth): name = element.name if not name: return if name in ["script", "style"]: return element_info = {"name": element.name, "depth": depth} if name in ["svg"]: element_info["text"] = None outline.append(element_info) return element_info["text"] = element.string # Check if the element has an "id" attribute if "id" in element.attrs: element_info["id"] = element["id"] if "class" in element.attrs: element_info["class"] = element["class"] outline.append(element_info) for child in element.children: process_element(child, depth + 1) try: for element in soup.body.children: process_element(element, 1) except: logger.error("get outline error") outline = [] return outline PROMPT_TEMPLATE = """Please complete the web page crawler parse function to achieve the User Requirement. The parse \ function should take a BeautifulSoup object as input, which corresponds to the HTML outline provided in the Context. ``python from bs4 import BeautifulSoup # only complete the parse function def parse(soup: BeautifulSoup): ... # Return the object that the user wants to retrieve, don't use print `` ## User Requirement {requirement} ## Context The outline of html page to scrabe is show like below: ``tree {outline} `` """ class WriteCrawlerCode(Action): async def run(self, url, requirement): codes = {} codes[url] = await self._write_code(url, requirement) if codes[url] == None: return None return "\n".join(f"# {url}\n{code}" for url, code in codes.items()) ## 返回固定格式的url + 相应爬虫代码 async def _write_code(self, url, query): page = await WebBrowserEngine().run(url) outline = get_outline(page) if len(outline) == 0: return None outline = "\n".join( f"{' '*i['depth']}{'.'.join([i['name'], *i.get('class', [])])}: {i['text'] if i['text'] else ''}" for i in outline ) code_rsp = await self._aask(PROMPT_TEMPLATE.format(outline=outline, requirement=query)) code = CodeParser.parse_code(block="", text=code_rsp) return code
2.1.2 自动执行爬虫代码
class RunCrawlerCode(Action): async def run(self, url, codes): code, current = codes.rsplit(f"# {url}", maxsplit=1) name = uuid4().hex module = type(sys)(name) exec(current, module.__dict__) page = await WebBrowserEngine().run(url) data = getattr(module, "parse")(page.soup) print(data) return str(data) # 以字符串形式返回
2.1.3 爬虫工程师角色定义
函数中,如果执行的是 WriteCrawlerCode
后面的动作通过 msg = self.rc.memory.get(k=1)[0]
from metagpt.roles import Role class CrawlerEngineer(Role): ...... def __init__(self, **kwargs) -> None: super().__init__(**kwargs) self.set_actions([WriteCrawlerCode, RunCrawlerCode]) async def _think(self) -> None: ...... async def _act(self) -> Message: """Perform an action as determined by the role. Returns: A message containing the result of the action. """ todo = self.rc.todo if type(todo) is WriteCrawlerCode: resp = await todo.run(url=self.url, requirement=self.requirement) logger.info(resp) if (resp == None): return None self.rc.memory.add(Message(content=resp, role=self.profile)) return resp msg = self.rc.memory.get(k=1)[0] resp = await todo.run(url=self.url, codes=msg.content) # 返回必须是字符串 logger.info(resp) return Message(content=resp, role=self.profile) # resp必须是字符串,MetaGPT中限制的 async def _react(self) -> Message: ......
2.1.4 将以上内容封装调用
class CrawlerDataProvider(): def __init__(self) -> None: pass async def run(self, url, requirement="获取正文中的所有文字内容,如果正文有code,将code也作为文字内容"): msg = "start" role = CrawlerEngineer(url = url, requirement = requirement) logger.info(msg) result = await role.run(msg) logger.info("\n=========================================\n") logger.info(result) return result
if __name__ == "__main__": url="https://mp.weixin.qq.com/s/2m8MrsCxf5boiH4Dzpphrg" requirement="获取标题,正文中的所有文字内容,如果正文有code,将code也作为文字内容" data_provider = CrawlerDataProvider() asyncio.run(data_provider.run(url, requirement))
2.2 文本内容总结
PROMPT_TEMPLATE = """简要总结下面文字的内容: "{text}" 简要总结:""" class SummaryArticle(): def __init__(self) -> None: pass def run(self, text): prompt = PROMPT_TEMPLATE.format(text = text) response_content = openai_wrapper.get_chat_completion(prompt) print("response content: ", response_content) return response_content
2.3 组合使用示例
async def summary_url(url): text_provider = CrawlerDataProvider() text = await text_provider.run(url = url) summary = SummaryArticle() response_content = summary.run(text = text) return response_content if __name__ == "__main__": response = asyncio.run(summary_url(url = "https://mp.weixin.qq.com/s/L_gHW-_TIipmcyDcdQpZRA")) print(response)
3. 踩坑
(1)自动运行爬虫程序时,可能遇到 BeautifulSoup is not defined
的错误。一般原因是运行环境缺少 bs4
pip install bs4
(2)因为我的服务配置特别特别低,运行速度非常慢,所以经常出现网页获取超时等获取不到内容的情况,非常不稳定… 在自己笔记本上没问题。这个暂时无解。
