- 大家好,我是同学小张,日常分享AI知识和实战案例
- 欢迎 点赞 + 关注 👏,持续学习,持续干货输出。
- 一起交流💬,一起进步💪。
- 微信公众号也可搜【同学小张】 🙏
本站文章一览:
现在市面上有很多的AI+搜索的应用或插件,一直想学习其背后的实现原理。今天咱们就学习一下,并且亲自动手实践,从0开始,搭建一个自己的AI搜索引擎。最终实现效果如下:
话不多说,开干。
本文代码参考:https://mp.weixin.qq.com/s/6F22Mls7zYJw5xJE-d6GrA,在原代码基础上进行了修改,适配了OpenAI 的API。
0. 框架
先来搞定框架。
代码中,服务端使用了Python + Flask框架,前端使用HTML。通过 Flask的render_template函数渲染HTML页面。render_template 函数是 Flask 提供的一个工具,用于渲染 Jinja2 模板。Jinja2 是一个 Python 的模板引擎,它允许你在 HTML 文件中使用 Python 变量和表达式。
代码如下:
from flask import Flask, render_template, request, jsonify @app.route('/', methods=['GET']) def index(): chat_history = history return render_template('ai_search.html', history=chat_history)
代码中,HTML页面的名称为 “ai_search.html”。
注意,在使用此种方法渲染HTML页面时,需要将HTML文件放到templates文件夹下,否则找不到文件,报错。
也就是说,工程目录结构应该如下:
1. 服务端(Python + Flask)
服务端就是利用Flask封装一个个地接口,然后进行相应处理。
1.1 Search接口
@app.route('/search', methods=['GET', 'POST']) def search(): if request.method == 'POST': keyword = request.form['keyword'] elif request.method == 'GET': keyword = request.args.get('keyword', '') else: keyword = '' results = crawl_pages(keyword) output = "" for result in results: output += f"<li><a id='myID' href='javascript:void(0);' οnclick='handleLinkClick(\"{result['url']}\")'>{result['title']}</a></li><br>" return output
Search接口接收用户输入的关键字,然后调用 crawl_pages
接口去获取检索结果。
1.1.1 crawl_pages接口
def crawl_pages(query_text, page_num=2): browser = mechanicalsoup.Browser() query_text_encoded = quote(query_text) # 关键字编码,例如关键字中的中文要转码才能作为URL的参数 results = [] for page_index in range(1, page_num+1): url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}" page = browser.get(url) soup = BeautifulSoup(page.text, 'html.parser') web_content_links = soup.find_all('a', id=lambda x: x and x.startswith('web_content_')) for i, link in enumerate(web_content_links): target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0] results.append({'title': link.text, 'url': target_page}) return results
该接口通过关键字来去固定网页去检索该关键字,获取前两页的检索结果,通过前两页的检索结果,通过爬虫,将结果的标题和URL提取出来。
(1)url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}"
,这是表明去哪个网页搜索这个关键字。这个链接相当于以下操作,去CCTV网搜关键字:
(2)通过简单的爬虫,将以上获取到的检索结果界面中的所有结果的URL和标题提取出来:target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0]
,例如这一句,提取URL。
(3)然后你就会获得一堆的URL,返回给Search接口后,通过 output += f"
{result['title']}
组装结果,插入到HTML中去显示。也就是侧边栏的效果:
"
1.2 generate-text接口
@app.route('/generate-text', methods=['POST']) def generate_text_api(): prompt = request.json['prompt'] result = generate_text(prompt) return jsonify(result)
该接口是将用户输入的关键字当作Prompt,给大模型,让大模型根据这个信息回复点什么东西。中间没有什么特别的处理。要说值得注意的,就是 history.append({"user": prompt, "bot": generated_text})
来将对话信息添加到历史记录里面。
def get_openai_chat_completion(messages, temperature, model = "gpt-3.5-turbo-1106"): response = client.chat.completions.create( model = model, messages = messages, temperature = temperature, ) return response def generate_text(prompt, temperature=0.5): messages = [ { "role": "user", "content": prompt, } ] response = get_openai_chat_completion(messages = messages, temperature=temperature) generated_text = response.choices[0].message.content history.append({"user": prompt, "bot": generated_text}) # 将用户输入和模型输出添加到历史记录中 return {"status": "success", "response": generated_text}
这一步的效果如下,与检索毫无关系:
1.3 page_content接口
该接口是通过URL来获取网页内容。就是一个简单的爬虫程序,将网页中的文字和图片提取出来。
@app.route('/page_content') def page_content(): url = request.args.get('url', '') if not url: return '缺少 url 参数' browser = mechanicalsoup.Browser() page = browser.get(url) page.encoding = 'utf-8' # 指定页面的编码为 utf-8 soup = BeautifulSoup(page.text, 'html.parser') all_text = '' all_images = [] # 获取页面中所有文本内容 for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'span']): all_text += element.get_text() + ' ' # 获取页面中所有图片链接 for img in soup.find_all('img'): img_src = img.get('src') if img_src: all_images.append("https:"+img_src) return f"文本内容: {all_text}<br>图片链接: {', '.join(all_images)}"
2. 前端(HTML)
2.1 用户输入关键字后的动作
先来看下前端HTML代码中,当用户点击提交按钮后的动作,重点是下面几行。
inputForm.addEventListener('submit', async (event) => { ...... const aa = document.getElementById('listView'); aa.innerHTML = await getA(userInput); const response = await generateText(userInput); hideTypingAnimation(userMessage); ...... });
可以看到,当用户点击提交按钮后,首先调用了 getA 函数:
async function getA(prompt) { const response = await fetch(SERVER_URL + `/search?keyword=${prompt}`, { method: 'GET', headers: { 'Content-Type': 'application/json' } }); return await response.text(); }
getA函数,调用了服务端的 Search接口,去固定网页检索关键字,获取URL和标题列表。
然后,紧接着调用了 generateText 函数:
async function generateText(prompt) { const response = await fetch(SERVER_URL + '/generate-text', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt }) }); return await response.json(); }
generateText 函数,调用了服务端的 generate-text 函数,利用大模型进行回复。
2.2 用户点击侧边栏标题后的动作
当用户点击侧边栏的某个标题后,执行的动作如下:
async function handleLinkClick(link) { const content = await getPageContent(link); ...... const response = await generateText("总结内容:" + content); ...... }
首先,调用了 getPageContent 接口,通过服务端的 page_content 接口,爬取了该URL中的所有文字内容和图片内容。
然后,通过 generateText 接口,调用服务端的 generate-text 接口,使用大模型对这些文字内容进行总结,从而形成下面的效果:
3. 完整代码
3.1 ai_search.html
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Chat with AI</title> <style> body { display: flex; flex-direction: column; height: 100vh; margin: 0; font-family: Arial, sans-serif; } .website-container { position: fixed; top: 0; right: 0; width: 350px; height: 100%; border: 1px solid #ccc; overflow-y: auto; background-color: #f9f9f9; padding: 10px; } .chat-container { height: 100%; width: 85%; overflow: hidden; overflow-y: auto; padding: 10px; margin-right: 220px; /* 腾出右侧栏的宽度 */ } .chat-container::-webkit-scrollbar { display: none; } .avatar-user { width: 40px; height: 40px; background-color: #7fb8e7; /* 设置用户头像颜色 */ border-radius: 50%; /* 将头像设置为圆形 */ margin-left: 10px; /* 调整头像与消息之间的间距 */ } .avatar-bot { width: 40px; height: 40px; right: 0; background-color: #28a745; /* 设置机器人头像颜色 */ border-radius: 50%; /* 将头像设置为圆形 */ margin-right: 10px; /* 调整头像与消息之间的间距 */ object-fit: cover; /* 防止头像变形 */ } .message { display: flex; align-items: center; /* 垂直居中消息和头像 */ margin-bottom: 1rem; } .message-text { padding: 10px; word-wrap: break-word; border-radius: 6px; max-width: 70%; margin:100px; } .message-text-user { padding: 10px; border-radius: 6px; max-width: 70%; margin:100px; word-wrap: break-word; background-color: #ececec; } .user-message { display: flex; justify-content: flex-end; } .bot-message .message-text { background-color: #2ea44f; color: white; } .input-container { position: fixed; bottom: 0; left: 0; width: calc(100% - 220px); /* 考虑右侧栏的宽度 */ display: flex; align-items: center; background-color: #f9f9f9; padding: 10px; } .input-field { flex-grow: 1; padding: 0.75rem; border: 1px solid #d1d5da; border-radius: 6px; margin-right: 1rem; } .send-button { padding: 0.75rem 1rem; background-color: #2ea44f; color: white; border: none; border-radius: 6px; cursor: pointer; } .del-button { padding: 0.75rem 1rem; background-color: #aeaeae; color: white; border: none; margin-right: 10px; border-radius: 6px; cursor: pointer; } .send-button:disabled { opacity: 0.5; cursor: not-allowed; } .typing-indicator { position: absolute; margin-bottom: 50px font-size: 0.8rem; color: #586069; } .typing:before, .typing:after { content: ''; display: inline-block; width: 0.75rem; height: 0.75rem; border-radius: 50%; margin-right: 0.25rem; animation: typing 1s infinite; } @keyframes typing { 0% { transform: scale(0); } 50% { transform: scale(1); } 100% { transform: scale(0); } } /* 样式定义 */ .listView { list-style-type: none; margin: 0; padding: 0; } .listView li { background-color: #f4f4f4; padding: 10px; margin-bottom: 5px; box-shadow: 2px 2px 5px rgba(0, 0, 0, 0.1); transition: box-shadow 0.3s ease; } .listView li:hover { box-shadow: 2px 2px 10px rgba(0, 0, 0, 0.2); } .listView li a { text-decoration: none; color: #333; display: block; transition: color 0.3s ease; } .listView li a:hover { color: #ff6600; } </style> </head> <body style="display: flex; flex-direction: column; height: 100vh;"> <div id="website-container" class="website-container"> <ul class="listView" id="listView"></ul> </div> <div style="height: 90%; width:80%; overflow-y: auto; display: flex; flex-direction: column;"> <ul class="chat-container" id="chat-container"> {% for item in history %} {% if loop.index == 1 %} <!-- 对于第一条消息,可能想要做一些特殊处理 --> <li class="message user-message"> <div class="message-text-user">{{ item.user }}</div> <!-- 这里应该插入用户消息 --> <div class="avatar-user"></div> </li> <li class="message bot-message"> <div class="avatar-bot"></div> <div class="message-text">{{ item.bot }}</div> <!-- 这里应该插入机器人消息 --> </li> {% else %} <!-- 对于其他消息,正常处理 --> <li class="message user-message"> <div class="message-text-user">{{ item.user }}</div> <div class="avatar-user"></div> </li> <li class="message bot-message"> <div class="avatar-bot"></div> <div class="message-text">{{ item.bot }}</div> </li> {% endif %} {% endfor %} </ul> </div> <form class="input-container" id="input-form" method="POST" style="position: fixed; bottom: 0; left: 0; width: 65%;"> <button type="button" class="del-button" id="del-button" style="width: 100px;" onclick='del()'>清除</button> <input type="text" placeholder="你负责搜,我负责找" class="input-field" id="input-field" name="prompt" autocomplete="off" style="width: calc(100% - 100px);"> <button type="submit" class="send-button" id="send-button" disabled style="width: 100px;">搜索</button> </form> <script> const SERVER_URL = ''; const inputForm = document.getElementById('input-form'); const inputField = document.getElementById('input-field'); const chatContainer = document.getElementById('chat-container'); inputField.addEventListener('input', () => { const userInput = inputField.value.trim(); document.getElementById('send-button').disabled = !userInput; }); inputForm.addEventListener('submit', async (event) => { event.preventDefault(); const userInput = inputField.value.trim(); const chatContainer = document.getElementById('chat-container'); if (!userInput) { return; } const userMessage = createMessageElement(userInput, 'user-message', "message-text-user", "avatar-user"); chatContainer.appendChild(userMessage); inputField.value = ''; chatContainer.scrollTop = chatContainer.scrollHeight; inputField.disabled = true; document.getElementById('send-button').disabled = true; showTypingAnimation(userMessage); const aa = document.getElementById('listView'); aa.innerHTML = await getA(userInput); const response = await generateText(userInput); hideTypingAnimation(userMessage); if (response.status === 'success') { const botResponse = createMessageElement(response.response, 'bot-message', "message-text", "avatar-bot"); chatContainer.appendChild(botResponse); printMessageText(botResponse); } else { alert(response.message); } inputField.disabled = false; inputField.focus(); }); async function generateText(prompt) { const response = await fetch(SERVER_URL + '/generate-text', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt }) }); return await response.json(); } async function getA(prompt) { const response = await fetch(SERVER_URL + `/search?keyword=${prompt}`, { method: 'GET', headers: { 'Content-Type': 'application/json' } }); return await response.text(); } function createMessageElement(text, className, name, bot) { const message = document.createElement('li'); message.classList.add('message', className, 'typing'); if (bot == "avatar-bot") { message.innerHTML = ` <div class=${bot}></div> <div class=${name}>${text}</div> `; } else { message.innerHTML = ` <div class=${name}>${text}</div> <div class=${bot}></div> `; } return message; } function showTypingAnimation(element) { const chatContainer = document.getElementById('chat-container'); chatContainer.scrollTop = chatContainer.scrollHeight + 10; const rect = element.getBoundingClientRect(); const topPosition = rect.top + window.scrollY + rect.height; const leftPosition = rect.left + window.scrollX; const typingIndicator = document.createElement('div'); typingIndicator.classList.add('typing-indicator'); typingIndicator.style.top = `${topPosition}px`; typingIndicator.style.left = `${leftPosition}px`; typingIndicator.innerHTML = '思考中...'; document.body.appendChild(typingIndicator); } function hideTypingAnimation(element) { const typingIndicator = document.querySelector('.typing-indicator'); if (typingIndicator) { typingIndicator.remove(); } element.classList.remove('typing'); } // 添加逐字打印效果 function printMessageText(message) { const chatContainer = document.getElementById('chat-container'); const text = message.querySelector('.message-text'); const textContent = text.textContent; text.textContent = ''; for (let i = 0; i < textContent.length; i++) { setTimeout(() => { text.textContent += textContent.charAt(i); chatContainer.scrollTop = chatContainer.scrollHeight; }, i * 10); // 控制打印速度 } } async function handleLinkClick(link) { const content = await getPageContent(link); console.log(link); console.log(content); const userMessage = createMessageElement("总结中:" + link, 'user-message', "message-text-user", "avatar-user"); showTypingAnimation(userMessage); const chatContainer = document.getElementById('chat-container'); chatContainer.appendChild(userMessage); const response = await generateText("总结内容:" + content); hideTypingAnimation(userMessage); if (response.status === 'success') { const botResponse = createMessageElement(response.response, 'bot-message', "message-text", "avatar-bot"); chatContainer.appendChild(botResponse); printMessageText(botResponse); } else { alert(response.message); } } function del(url) { const response = fetch(SERVER_URL + `/clear`, { method: 'POST' }); location.replace("/"); return 0; } // 获取页面内容 async function getPageContent(url) { const response = await fetch(SERVER_URL + `/page_content?url=${url}`, { method: 'GET' }); return await response.text(); } </script> </body> </html>
3.2 ai_search.py
from flask import Flask, render_template, request, jsonify from http import HTTPStatus from openai import OpenAI import mechanicalsoup from bs4 import BeautifulSoup from flask_cors import CORS from urllib.parse import urlparse, parse_qs, quote app = Flask(__name__) client = OpenAI() CORS(app) history = [] def crawl_pages(query_text, page_num=2): browser = mechanicalsoup.Browser() query_text_encoded = quote(query_text) results = [] for page_index in range(1, page_num+1): url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}" page = browser.get(url) soup = BeautifulSoup(page.text, 'html.parser') web_content_links = soup.find_all('a', id=lambda x: x and x.startswith('web_content_')) for i, link in enumerate(web_content_links): target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0] results.append({'title': link.text, 'url': target_page}) return results def get_openai_chat_completion(messages, temperature, model = "gpt-3.5-turbo-1106"): response = client.chat.completions.create( model = model, messages = messages, temperature = temperature, ) return response def generate_text(prompt, temperature=0.5): messages = [ { "role": "user", "content": prompt, } ] response = get_openai_chat_completion(messages = messages, temperature=temperature) generated_text = response.choices[0].message.content history.append({"user": prompt, "bot": generated_text}) # 将用户输入和模型输出添加到历史记录中 return {"status": "success", "response": generated_text} @app.route('/', methods=['GET']) def index(): chat_history = history return render_template('ai_search.html', history=chat_history) @app.route('/generate-text', methods=['POST']) def generate_text_api(): prompt = request.json['prompt'] result = generate_text(prompt) return jsonify(result) @app.route('/clear', methods=['POST']) def clear(): global history history = [] return '', HTTPStatus.NO_CONTENT @app.route('/search', methods=['GET', 'POST']) def search(): if request.method == 'POST': keyword = request.form['keyword'] elif request.method == 'GET': keyword = request.args.get('keyword', '') else: keyword = '' results = crawl_pages(keyword) output = "" for result in results: output += f"<li><a id='myID' href='javascript:void(0);' οnclick='handleLinkClick(\"{result['url']}\")'>{result['title']}</a></li><br>" return output @app.route('/page_content') def page_content(): url = request.args.get('url', '') if not url: return '缺少 url 参数' browser = mechanicalsoup.Browser() page = browser.get(url) page.encoding = 'utf-8' # 指定页面的编码为 utf-8 soup = BeautifulSoup(page.text, 'html.parser') all_text = '' all_images = [] # 获取页面中所有文本内容 for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'span']): all_text += element.get_text() + ' ' # 获取页面中所有图片链接 for img in soup.find_all('img'): img_src = img.get('src') if img_src: all_images.append("https:"+img_src) return f"文本内容: {all_text}<br>图片链接: {', '.join(all_images)}" if __name__ == '__main__': app.run(debug=True)
3.3 运行
运行 ai_search.py,打开提示中链接。
3.4 可能需要安装的依赖
pip install Flask -i https://pypi.tuna.tsinghua.edu.cn/simple pip install mechanicalsoup -i https://pypi.tuna.tsinghua.edu.cn/simple pip install Jinja2
3.5 一定是通过Jinja2加载HTML,而不是直接打开HTML
直接打开HTML文件会显示异常:
4. 总结
本文我们从0开始写了一个AI+搜索的搜索引擎。整体原理还是比较简单的,搜索的原理就是固定URL+关键字,然后爬取网页内的标题和URL,就算是结果了。至于文本总结就更不用多说了,前面的文章详细介绍和实践过。
这个例子很简单,但应该算比较完整了,可以作为后续类似项目的快速开始,在此基础上快速搭建出自己的原型产品。
大家可以上手运行一下,然后运行过程中,你会对这个例子产生一些改进的想法。
有改进的想法,可以一起交流~
如果觉得本文对你有帮助,麻烦点个赞和关注呗 ~~~
- 大家好,我是 同学小张,日常分享AI知识和实战案例
- 欢迎 点赞 + 关注 👏,持续学习,持续干货输出。
- 一起交流💬,一起进步💪。
- 微信公众号也可搜【同学小张】 🙏
本站文章一览: