可控细节的长文档摘要,探索开源LLM工具与实践

简介: 本文通过将文档分为几部分来解决这个问题,然后分段生成摘要。在对大语言模型进行多次查询后,可以重建完整的摘要。通过控制文本块的数量及其大小,我们最终可以控制输出中的细节级别。

 本文的想法来自今年OpenAI cookbook的一篇实践:summarizing_long_documents,目标是演示如何以可控的细节程度总结大型文档。

如果我们想让大语言模型总结一份长文档(例如 10k 或更多tokens),但是直接输入大语言模型往往会得到一个相对较短的摘要,该摘要与文档的长度并不成比例。例如,20k tokens的文档的摘要不会是 10k tokens的文档摘要的两倍长。本文通过将文档分为几部分来解决这个问题,然后分段生成摘要。在对大语言模型进行多次查询后,可以重建完整的摘要。通过控制文本块的数量及其大小,我们最终可以控制输出中的细节级别。

本文使用的工具和模型如下:

大语言模型:Qwen2的GGUF格式模型

工具1:Ollama,将大语言模型GGUF部署成OpenAI格式的API

工具2:transformers,使用transformers的新功能,直接加载GGUF格式模型的tokenizer,用于文档长度查询和分段。

最佳实践

运行Qwen2模型(详见《魔搭社区GGUF模型怎么玩!看这篇就够了》

复制模型路径,创建名为“ModelFile”的meta文件,内容如下:

FROM /mnt/workspace/qwen2-7b-instruct-q5_k_m.gguf
# set the temperature to 0.7 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""
# set the system message
SYSTEM """
You are a helpful assistant.
"""

image.gif

使用ollama create命令创建自定义模型并运行

ollama create myqwen2 --file ./ModelFile
ollama run myqwen2

image.gif

安装依赖&读取需要总结的文档

import os
from typing import List, Tuple, Optional
from openai import OpenAI
from transformers import AutoTokenizer
from tqdm import tqdm
# load doc
with open("data/artificial_intelligence_wikipedia.txt", "r") as file:
    artificial_intelligence_wikipedia_text = file.read()

image.gif

加载encoding并检查文档长度

HuggingFace的transformers 支持加载GGUF单文件格式,以便为 gguf 模型提供进一步的训练/微调功能,然后再将这些模型转换回生态系统gguf中使用ggml,GGUF文件通常包含配置属性,tokenizer,以及其他的属性,以及要加载到模型的所有tensor,参考文档:https://huggingface.co/docs/transformers/gguf 

目前支持的模型架构为:llama,mistral,qwen2

# load encoding and check the length of dataset
encoding = AutoTokenizer.from_pretrained("/mnt/workspace/cherry/",gguf_file="qwen2-7b-instruct-q5_k_m.gguf")
len(encoding.encode(artificial_intelligence_wikipedia_text))

image.gif

调用LLM的OpenAI格式的API

client = OpenAI(
    base_url = 'http://127.0.0.1:11434/v1',
    api_key='ollama', # required, but unused
)
def get_chat_completion(messages, model='myqwen2'):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

image.gif

文档拆解

我们定义了一些函数,将大文档分成较小的部分。

def tokenize(text: str) -> List[str]:    
    return encoding.encode(text)
# This function chunks a text into smaller pieces based on a maximum token count and a delimiter.
def chunk_on_delimiter(input_string: str,
                       max_tokens: int, delimiter: str) -> List[str]:
    chunks = input_string.split(delimiter)
    combined_chunks, _, dropped_chunk_count = combine_chunks_with_no_minimum(
        chunks, max_tokens, chunk_delimiter=delimiter, add_ellipsis_for_overflow=True
    )
    if dropped_chunk_count > 0:
        print(f"warning: {dropped_chunk_count} chunks were dropped due to overflow")
    combined_chunks = [f"{chunk}{delimiter}" for chunk in combined_chunks]
    return combined_chunks
# This function combines text chunks into larger blocks without exceeding a specified token count. It returns the combined text blocks, their original indices, and the count of chunks dropped due to overflow.
def combine_chunks_with_no_minimum(
        chunks: List[str],
        max_tokens: int,
        chunk_delimiter="\n\n",
        header: Optional[str] = None,
        add_ellipsis_for_overflow=False,
) -> Tuple[List[str], List[int]]:
    dropped_chunk_count = 0
    output = []  # list to hold the final combined chunks
    output_indices = []  # list to hold the indices of the final combined chunks
    candidate = (
        [] if header is None else [header]
    )  # list to hold the current combined chunk candidate
    candidate_indices = []
    for chunk_i, chunk in enumerate(chunks):
        chunk_with_header = [chunk] if header is None else [header, chunk]
        if len(tokenize(chunk_delimiter.join(chunk_with_header))) > max_tokens:
            print(f"warning: chunk overflow")
            if (
                    add_ellipsis_for_overflow
                    and len(tokenize(chunk_delimiter.join(candidate + ["..."]))) <= max_tokens
            ):
                candidate.append("...")
                dropped_chunk_count += 1
            continue  # this case would break downstream assumptions
        # estimate token count with the current chunk added
        extended_candidate_token_count = len(tokenize(chunk_delimiter.join(candidate + [chunk])))
        # If the token count exceeds max_tokens, add the current candidate to output and start a new candidate
        if extended_candidate_token_count > max_tokens:
            output.append(chunk_delimiter.join(candidate))
            output_indices.append(candidate_indices)
            candidate = chunk_with_header  # re-initialize candidate
            candidate_indices = [chunk_i]
        # otherwise keep extending the candidate
        else:
            candidate.append(chunk)
            candidate_indices.append(chunk_i)
    # add the remaining candidate to output if it's not empty
    if (header is not None and len(candidate) > 1) or (header is None and len(candidate) > 0):
        output.append(chunk_delimiter.join(candidate))
        output_indices.append(candidate_indices)
    return output, output_indices, dropped_chunk_count

image.gif

摘要函数

现在我们可以定义一个实用程序来以可控的细节级别总结文本(注意参数detail)。

该函数首先根据可控参数在最小和最大块数之间进行插值来确定块数detail。然后,它将文本拆分成块并对每个块进行总结。

def summarize(text: str,
              detail: float = 0,
              model: str = 'myqwen2',
              additional_instructions: Optional[str] = None,
              minimum_chunk_size: Optional[int] = 500,
              chunk_delimiter: str = "\n",
              summarize_recursively=False,
              verbose=False):
    """
    Summarizes a given text by splitting it into chunks, each of which is summarized individually. 
    The level of detail in the summary can be adjusted, and the process can optionally be made recursive.
    Parameters:
    - text (str): The text to be summarized.
    - detail (float, optional): A value between 0 and 1 indicating the desired level of detail in the summary.
      0 leads to a higher level summary, and 1 results in a more detailed summary. Defaults to 0.
    - model (str, optional): The model to use for generating summaries. Defaults to 'gpt-3.5-turbo'.
    - additional_instructions (Optional[str], optional): Additional instructions to provide to the model for customizing summaries.
    - minimum_chunk_size (Optional[int], optional): The minimum size for text chunks. Defaults to 500.
    - chunk_delimiter (str, optional): The delimiter used to split the text into chunks. Defaults to ".".
    - summarize_recursively (bool, optional): If True, summaries are generated recursively, using previous summaries for context.
    - verbose (bool, optional): If True, prints detailed information about the chunking process.
    Returns:
    - str: The final compiled summary of the text.
    The function first determines the number of chunks by interpolating between a minimum and a maximum chunk count based on the `detail` parameter. 
    It then splits the text into chunks and summarizes each chunk. If `summarize_recursively` is True, each summary is based on the previous summaries, 
    adding more context to the summarization process. The function returns a compiled summary of all chunks.
    """
    # check detail is set correctly
    assert 0 <= detail <= 1
    # interpolate the number of chunks based to get specified level of detail
    max_chunks = len(chunk_on_delimiter(text, minimum_chunk_size, chunk_delimiter))
    min_chunks = 1
    num_chunks = int(min_chunks + detail * (max_chunks - min_chunks))
    # adjust chunk_size based on interpolated number of chunks
    document_length = len(tokenize(text))
    chunk_size = max(minimum_chunk_size, document_length // num_chunks)
    text_chunks = chunk_on_delimiter(text, chunk_size, chunk_delimiter)
    if verbose:
        print(f"Splitting the text into {len(text_chunks)} chunks to be summarized.")
        print(f"Chunk lengths are {[len(tokenize(x)) for x in text_chunks]}")
    # set system message
    system_message_content = "Rewrite this text in summarized form."
    if additional_instructions is not None:
        system_message_content += f"\n\n{additional_instructions}"
    accumulated_summaries = []
    for chunk in tqdm(text_chunks):
        if summarize_recursively and accumulated_summaries:
            # Creating a structured prompt for recursive summarization
            accumulated_summaries_string = '\n\n'.join(accumulated_summaries)
            user_message_content = f"Previous summaries:\n\n{accumulated_summaries_string}\n\nText to summarize next:\n\n{chunk}"
        else:
            # Directly passing the chunk for summarization without recursive context
            user_message_content = chunk
        # Constructing messages based on whether recursive summarization is applied
        messages = [
            {"role": "system", "content": system_message_content},
            {"role": "user", "content": user_message_content}
        ]
        # Assuming this function gets the completion and works as expected
        response = get_chat_completion(messages, model=model)
        accumulated_summaries.append(response)
    # Compile final summary from partial summaries
    final_summary = '\n\n'.join(accumulated_summaries)
    return final_summary

image.gif

 

现在,我们可以使用此实用程序生成具有不同详细程度的摘要。通过detail从 0 增加到 1,我们可以逐渐获得更长的底层文档摘要。参数值越高,detail摘要越详细,因为实用程序首先将文档拆分为更多块。然后对每个块进行汇总,最终摘要是所有块摘要的串联。

summary_with_detail_0 = summarize(artificial_intelligence_wikipedia_text, detail=0, verbose=True)


image.gif

summary_with_detail_pt25 = summarize(artificial_intelligence_wikipedia_text, detail=0.25, verbose=True)


image.gif

此实用程序还允许传递附加指令。

summary_with_additional_instructions = summarize(artificial_intelligence_wikipedia_text, detail=0.1,
                                                 additional_instructions="Write in point form and focus on numerical data.")
print(summary_with_additional_instructions)

image.gif

最后,请注意,该实用程序允许递归汇总,其中每个摘要都基于先前的摘要,从而为汇总过程添加更多上下文。可以通过将参数设置summarize_recursively为 True 来启用此功能。这在计算上更昂贵,但可以提高组合摘要的一致性和连贯性。

recursive_summary = summarize(artificial_intelligence_wikipedia_text, detail=0.1, summarize_recursively=True)
print(recursive_summary)

image.gif


相关文章
|
2月前
|
Kubernetes 搜索推荐 API
|
2月前
|
机器学习/深度学习 人工智能 自然语言处理
【LLM】能够运行在移动端的轻量级大语言模型Gemma实践
【4月更文挑战第12天】可以运行在移动端的开源大语言模型Gemma模型介绍
166 0
|
15天前
|
人工智能 自然语言处理 算法
LLM主流开源代表模型(二)
随着ChatGPT迅速火爆,引发了大模型的时代变革,国内外各大公司也快速跟进生成式AI市场,近百款大模型发布及应用。
|
15天前
|
机器学习/深度学习 人工智能 自然语言处理
LLM主流开源代表模型(一)
随着ChatGPT迅速火爆,引发了大模型的时代变革,国内外各大公司也快速跟进生成式AI市场,近百款大模型发布及应用。
|
10天前
|
存储 人工智能 机器人
LangChain结合LLM做私有化文档搜索
我们知道LLM(大语言模型)的底模是基于已经过期的公开数据训练出来的,对于新的知识或者私有化的数据LLM一般无法作答,此时LLM会出现“幻觉”。针对“幻觉”问题,一般的解决方案是采用RAG做检索增强。
LangChain结合LLM做私有化文档搜索
|
2月前
|
机器学习/深度学习 缓存 算法
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
|
2月前
|
机器学习/深度学习 人工智能
【LangChain系列】第九篇:LLM 应用评估简介及实践
【5月更文挑战第23天】本文探讨了如何评估复杂且精密的语言模型(LLMs)应用。通过创建QA应用程序,如使用GPT-3.5-Turbo模型,然后构建测试数据,包括手动创建和使用LLM生成示例。接着,通过手动评估、调试及LLM辅助评估来衡量性能。手动评估借助langchain.debug工具提供执行细节,而QAEvalChain则利用LLM的语义理解能力进行评分。这些方法有助于优化和提升LLM应用程序的准确性和效率。
339 8
|
2月前
|
缓存 人工智能 自然语言处理
LLM 大模型学习必知必会系列(三):LLM和多模态模型高效推理实践
LLM 大模型学习必知必会系列(三):LLM和多模态模型高效推理实践
LLM 大模型学习必知必会系列(三):LLM和多模态模型高效推理实践
|
1月前
|
人工智能 自然语言处理 算法
分享几个.NET开源的AI和LLM相关项目框架
分享几个.NET开源的AI和LLM相关项目框架
|
2月前
|
人工智能 自然语言处理 算法
分享几个.NET开源的AI和LLM相关项目框架
分享几个.NET开源的AI和LLM相关项目框架