【LangChain系列】第十篇:数据保护简介及实践

本文涉及的产品
交互式建模 PAI-DSW,每月250计算时 3个月
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
模型训练 PAI-DLC,100CU*H 3个月
简介: 【5月更文挑战第24天】本文探讨了在使用大型语言模型时保护个人数据的重要性,特别是涉及敏感信息如PII(个人可识别信息)的情况。为了降低数据泄露风险,文章介绍了数据匿名化的概念,通过在数据进入LLM前替换敏感信息。重点讲解了Microsoft的Presidio库,它提供了一个可定制的文本匿名化工具。此外,文章还展示了如何结合LangChain库创建一个安全的匿名化流水线,包括初始化匿名器、添加自定义识别器和操作符,以及在问答系统中集成匿名化流程。通过这种方式,可以在利用LLMs的同时保护数据隐私。

[toc]


今天看来,数据隐私非常重要,尤其是在使用大型语言模型(LLMs)和敏感信息时。公司和个人经常需要使用私人数据,比如个人可识别信息(PII),用于他们的LLM应用程序中。然而,数据泄露和隐私侵犯的风险是一个持续的威胁,这使得实施数据保护措施变得必要。

在这篇博客文章中,我们将探讨保护私人数据的解决方案,用于构建使用LLMs的问答系统。我们将深入探讨数据匿名化的概念,这涉及在将数据提供给LLM之前,用合成数据或占位符替换敏感信息。通过使用微软的LangChain和Presidio库,我们可以为我们的LLM应用程序创建一个安全且可定制的匿名化流水线。

一、什么是PII

个人可识别信息(PII)指的是任何可以用来识别、联系或定位个人的数据。PII可分为两种类型:
image.png

  • 关联信息:这包括直接标识符,如电子邮件地址、电话号码、社会安全号码和护照号码。
  • 可关联信息:这指的是间接信息,可以与其他数据结合以识别个人。例如,年龄、职业和位置的组合可能唯一地识别某人。

保护个人可识别信息免受潜在数据泄露或未经授权访问是至关重要的,因为后果可能严重,包括身份盗用、金融诈骗和法律责任。

二、如何保护数据

image.png

在使用OpenAI或Anthropic等外部API时,我们的数据可能存在泄霏或存储一定时间(如30天)的风险。即使我们托管自己的LLM实例,仍然存在数据泄露或在训练过程中模型记住敏感信息的风险。为了避免这些风险,我们有两个主要选择:

托管自己的LLM:这使我们能够将数据保留在本地,但可能成本高昂,并且可用模型可能无法与GPT-4o或其他最先进的LLM的性能匹配。
image.png

在将数据提供给LLM之前进行匿名化:通过用占位符或合成数据替换敏感信息,我们可以在使用外部LLM或API时保护我们的隐私数据。在本博客文章中,我们将专注于第二个选项:使用LangChain和Presidio进行数据匿名化。

三、Presidio介绍

Presidio是Microsoft开源的一个库,为文本数据提供了强大且可定制的匿名化工具。它由两个主要组件组成:
image.png

  • 分析器:此组件使用内置模式、正则表达式和命名实体识别模型,识别和识别文本中的PII实体。

  • 匿名化器:此组件用占位符、标记或合成数据替换识别的PII实体。

Presidio提供高度可定制性,允许我们添加自定义识别器和运算符来处理特定数据格式或要求。

四、使用LangChain和Presidio进行匿名化

让我们深入代码,探索如何将Presidio与LangChain集成,创建一个具有数据匿名化功能的安全问答系统。

image.png

1.初始化匿名器

首先,我们需要从LangChain初始化PresidioReversibleAnonymizer,它提供了在匿名化后恢复原始数据的功能。

from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
import re

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=False,
)

2.匿名化数据

现在,我们可以通过用占位符或标记替换已识别的PII实体来对文本进行匿名化处理:

from langchain_core.documents import Document

def print_colored_pii(string):
    colored_string = re.sub(
        r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
    )
    print(colored_string)

document_content = """Date: October 19, 2021
 Witness: John Doe
 Subject: Testimony Regarding the Loss of Wallet

 Testimony Content:

 Hello Officer,

 My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.

 Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.

 Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532.

 What's more, I had my polish identity card there, with the number ABC123456.

 I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.

 In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.

 Please consider this information to be highly confidential and respect my privacy.

 The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.
 My representative there is Victoria Cherry (her business phone: 987-654-3210).

 Thank you for your assistance,

 John Doe"""

documents = [Document(page_content=document_content)]

anonymized_text = anonymizer.anonymize(document_content)
print_colored_pii(anonymized_text)
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.

Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>. 

What's more, I had my polish identity card there, with the number ABC123456.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at <DATE_TIME_2>.

In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.

Please consider this information to be highly confidential

这将输出带有标记的匿名文本,如、等,替换敏感信息。

from pprint import pprint
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'UK_NHS': {'<UK_NHS>': '987-654-3210'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}

3.添加自动识别器

由于 Presidio 默认情况下可能无法识别某些 PII 实体,我们可以添加自定义识别器来处理特定的数据格式。例如,让我们为波兰身份证号码和时间格式创建识别器:

from presidio_analyzer import Pattern, PatternRecognizer

polish_id_pattern = Pattern(
    name="polish_id_pattern",
    regex="[A-Z]{3}\d{6}",
    score=1,
)
time_pattern = Pattern(
    name="time_pattern",
    regex="(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)",
    score=1,
)

polish_id_recognizer = PatternRecognizer(
    supported_entity="POLISH_ID", patterns=[polish_id_pattern]
)
time_recognizer = PatternRecognizer(supported_entity="TIME", patterns=[time_pattern])

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

# Note that our anonymization instance remembers previously detected and anonymized values, 
# including those that were not detected correctly (e.g., "9:30 AM" taken as DATE_TIME). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.

Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>. 

What's more, I had my polish identity card there, with the number <POLISH_ID>.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at <TIME>.

In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, <EMAIL_ADDRESS_2>.
My representative there is <PERSON_2> (her business phone: <UK_NHS>).

Thank you for your assistance,

<PERSON>
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},
 'TIME': {'<TIME>': '9:30 AM'},
 'UK_NHS': {'<UK_NHS>': '987-654-3210'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}

我们的新识别器工作正常。匿名器已将时间和波兰身份证实体替换为<TIME><POLISH_ID>标记,并且去匿名器映射也已相应更新。

4. 综合原始值

现在,当正确检测到所有 PII 值时,我们可以继续下一步,即用合成值替换原始值。为此,我们需要设置add_default_faker_operators=True(或者只是删除此参数,因为它True默认设置为):

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=True,
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=42,
)

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

print_colored_pii(anonymizer.anonymize(document_content))
Date: 1986-04-18
Witness: Brian Cox DVM
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Brian Cox DVM and on 1986-04-18, my wallet was stolen in the vicinity of New Rita during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 6584801845146275, which is registered under my name and linked to my bank account, GB78GSWK37672423884969.

Additionally, the wallet had a driver's license - DL No: 781802744 issued to my name. It also houses my Social Security Number, 687-35-1170. 

What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.

In case any information arises regarding my wallet, please reach out to me on my phone number, 7344131647, or through my personal email, jamesmichael@example.com.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, blakeerik@example.com.
My representative there is Cristian Santos (her business phone: 2812140441).

Thank you for your assistance,

Brian Cox DVM

如您所见,几乎所有值都已被替换为合成值。唯一的例外是波兰身份证号码和时间,默认的伪造者操作员不支持这些值。我们可以向匿名器添加新的操作员,这将生成随机数据。

5.添加自定义运算符(可选)

虽然使用占位符或标记是一种有效的方法,但通常最好用合成数据替换 PII 实体,以提高 LLM 的性能。我们可以向匿名器添加自定义运算符,以生成特定实体类型的合成数据:

from faker import Faker
from presidio_anonymizer.entities import OperatorConfig

fake = Faker()

def fake_polish_id(_=None):
    """ Example output: 'VTC592627'""" 
    return fake.bothify(text="???######").upper() 

def fake_time(_=None):
    """ Example output: '03:14 PM'""" 
    return fake.time(pattern="%I:%M %p")

new_operators = {
   
   
    "POLISH_ID": OperatorConfig("custom", {
   
   "lambda": fake_polish_id}),
    "TIME": OperatorConfig("custom", {
   
   "lambda": fake_time}),
}

anonymizer.add_operators(new_operators)

现在,当我们将文本匿名化时,波兰 ID 和时间实体将被替换为由我们的自定义操作员生成的合成数据。

anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1974-12-26
Witness: Jimmy Murillo
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Jimmy Murillo and on 1974-12-26, my wallet was stolen in the vicinity of South Dianeshire during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 213108121913614, which is registered under my name and linked to my bank account, GB17DBUR01326773602606.

Additionally, the wallet had a driver's license - DL No: 532311310 issued to my name. It also houses my Social Security Number, 690-84-1613. 

What's more, I had my polish identity card there, with the number UFB745084.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at 11:54 AM.

In case any information arises regarding my wallet, please reach out to me on my phone number, 876.931.1656, or through my personal email, briannasmith@example.net.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, samuel87@example.org.
My representative there is Joshua Blair (her business phone: 3361388464).

Thank you for your assistance,

Jimmy Murillo
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'213108121913614': '4111 1111 1111 1111'},
 'DATE_TIME': {'1974-12-26': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'briannasmith@example.net': 'johndoe@example.com',
                   'samuel87@example.org': 'support@bankname.com'},
 'IBAN_CODE': {'GB17DBUR01326773602606': 'PL61109010140000071219812874'},
 'LOCATION': {'South Dianeshire': 'Kilmarnock'},
 'PERSON': {'Jimmy Murillo': 'John Doe', 'Joshua Blair': 'Victoria Cherry'},
 'PHONE_NUMBER': {'876.931.1656': '999-888-7777'},
 'POLISH_ID': {'UFB745084': 'ABC123456'},
 'TIME': {'11:54 AM': '9:30 AM'},
 'UK_NHS': {'3361388464': '987-654-3210'},
 'US_DRIVER_LICENSE': {'532311310': '999000680'},
 'US_SSN': {'690-84-1613': '602-76-4532'}}

现在所有值都替换为合成值。请注意,去匿名器映射已相应更新。

6. 建立问答系统

设置匿名器后,我们可以使用 LangChain 将其集成到问答系统中:

from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and anonymize the data
documents = [Document(page_content=document_content)]
for doc in documents:
    doc.page_content = anonymizer.anonymize(doc.page_content)

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

# Index the chunks (using OpenAI embeddings, since the data is already anonymized)
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(chunks, embeddings)
retriever = docsearch.as_retriever()

# Create the anonymizer chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI

template = """Answer the question based only on the following context: {context}
Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0.3)
_inputs = RunnableParallel(
    question=RunnablePassthrough(),
    anonymized_question=RunnableLambda(anonymizer.anonymize),
)
anonymizer_chain = (
    _inputs
    | {
   
   
        "context": itemgetter("anonymized_question") | retriever,
        "anonymized_question": itemgetter("anonymized_question"),
    }
    | prompt
    | model
    | StrOutputParser()
)
anonymizer_chain.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?")
# 'The theft of the wallet occurred in the vicinity of New Rita during a bike trip. It was stolen from Brian Cox DVM. The time of the theft was 02:22 AM.'

# Add deanonymization step to the chain
chain_with_deanonymization = anonymizer_chain | RunnableLambda(anonymizer.deanonymize)

print(chain_with_deanonymization.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?"))
# The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.
print(chain_with_deanonymization.invoke("What was the content of the wallet in detail?"))
# The content of the wallet included a credit card with the number 4111 1111 1111 1111, registered under the name of John Doe and linked to the bank account PL61109010140000071219812874. It also contained a driver's license with the number 999000680 issued to John Doe, as well as his Social Security Number 602-76-4532. Additionally, the wallet had a Polish identity card with the number ABC123456.
print(chain_with_deanonymization.invoke("Whose phone number is it: 999-888-7777?"))
# The phone number 999-888-7777 belongs to John Doe.

五、局部嵌入

如果您希望以原始形式对数据进行索引或使用自定义嵌入,则可以采用替代方法,即在索引后对上下文进行匿名化:

from operator import itemgetter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_core.prompts import format_document
from langchain_core.prompts.prompt import PromptTemplate

model_name = "BAAI/bge-base-en-v1.5"
encode_kwargs = {
   
   "normalize_embeddings": True}
local_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, encode_kwargs=encode_kwargs, query_instruction="Represent this sentence for searching relevant passages:",
)

documents = [Document(page_content=document_content)]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

docsearch = FAISS.from_documents(chunks, local_embeddings)
retriever = docsearch.as_retriever()

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def _combine_documents(docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

# Anonymize the context after retrieval
chain_with_deanonymization = (
    RunnableParallel({
   
   "question": RunnablePassthrough()})
    | {
   
   
        "context": itemgetter("question")
        | retriever
        | _combine_documents
        | anonymizer.anonymize,
        "anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
    }
    | prompt
    | model
    | StrOutputParser()
    | RunnableLambda(anonymizer.deanonymize)
)

该方法从矢量数据库中检索原始上下文,即时将其匿名化,然后执行与之前相同的匿名化和去匿名化步骤。

小结

在这篇博文中,我们探讨了在使用 LLM 构建问答系统时保护私人数据的解决方案。通过使用 LangChain 和 Presidio 库,我们可以创建一个安全且可定制的匿名化管道,在将敏感信息输入到 LLM 之前,用占位符或合成数据替换敏感信息。我们介绍了 PII、数据匿名化和 Presidio 库的基本概念。然后我们详细介绍了实现细节,包括初始化匿名器、添加自定义识别器和运算符,以及使用 LangChain 将匿名化过程集成到问答系统中。通过采用这种方法,我们可以确保我们的私人数据受到保护,同时仍然受益于最先进的法学硕士和外部 API 的功能。

小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!

相关实践学习
阿里云百炼xAnalyticDB PostgreSQL构建AIGC应用
通过该实验体验在阿里云百炼中构建企业专属知识库构建及应用全流程。同时体验使用ADB-PG向量检索引擎提供专属安全存储,保障企业数据隐私安全。
AnalyticDB PostgreSQL 企业智能数据中台:一站式管理数据服务资产
企业在数据仓库之上可构建丰富的数据服务用以支持数据应用及业务场景;ADB PG推出全新企业智能数据平台,用以帮助用户一站式的管理企业数据服务资产,包括创建, 管理,探索, 监控等; 助力企业在现有平台之上快速构建起数据服务资产体系
目录
相关文章
|
2天前
|
数据采集 供应链 搜索推荐
商业案例 I AllData数据中台商业版落地实践
杭州奥零数据科技有限公司成立于2023年,专注于数据中台业务,维护开源项目AllData并提供商业版解决方案。AllData提供数据集成、存储、开发、治理及BI展示等一站式服务,支持AI大模型应用,助力企业高效利用数据价值。
|
1月前
|
JSON 数据可视化 NoSQL
基于LLM Graph Transformer的知识图谱构建技术研究:LangChain框架下转换机制实践
本文介绍了LangChain的LLM Graph Transformer框架,探讨了文本到图谱转换的双模式实现机制。基于工具的模式利用结构化输出和函数调用,简化了提示工程并支持属性提取;基于提示的模式则为不支持工具调用的模型提供了备选方案。通过精确定义图谱模式(包括节点类型、关系类型及其约束),显著提升了提取结果的一致性和可靠性。LLM Graph Transformer为非结构化数据的结构化表示提供了可靠的技术方案,支持RAG应用和复杂查询处理。
122 2
基于LLM Graph Transformer的知识图谱构建技术研究:LangChain框架下转换机制实践
|
4月前
|
自然语言处理 搜索推荐 机器人
langchain 简介
langchain 简介
147 1
|
4月前
|
机器学习/深度学习 自然语言处理 搜索推荐
LangChain在个性化内容生成中的实践
【8月更文第3天】随着人工智能技术的发展,个性化内容生成已经成为许多应用的核心竞争力。LangChain 是一种开源框架,旨在简化语言模型的应用开发,尤其是针对自然语言处理任务。本文将探讨 LangChain 如何帮助开发者根据用户的偏好生成定制化的内容,从挑战到实践策略,再到具体的案例分析和技术实现。
276 1
|
7月前
|
机器学习/深度学习 人工智能
【LangChain系列】第九篇:LLM 应用评估简介及实践
【5月更文挑战第23天】本文探讨了如何评估复杂且精密的语言模型(LLMs)应用。通过创建QA应用程序,如使用GPT-3.5-Turbo模型,然后构建测试数据,包括手动创建和使用LLM生成示例。接着,通过手动评估、调试及LLM辅助评估来衡量性能。手动评估借助langchain.debug工具提供执行细节,而QAEvalChain则利用LLM的语义理解能力进行评分。这些方法有助于优化和提升LLM应用程序的准确性和效率。
586 8
|
7月前
|
存储 机器学习/深度学习 人工智能
【LangChain系列】第八篇:文档问答简介及实践
【5月更文挑战第22天】本文探讨了如何使用大型语言模型(LLM)进行文档问答,通过结合LLM与外部数据源提高灵活性。 LangChain库被介绍为简化这一过程的工具,它涵盖了嵌入、向量存储和不同类型的检索问答链,如Stuff、Map-reduce、Refine和Map-rerank。文章通过示例展示了如何使用LLM从CSV文件中提取信息并以Markdown格式展示
320 2
|
7月前
|
机器学习/深度学习 自然语言处理 数据挖掘
【LangChain系列】第七篇:工作流(链)简介及实践
【5月更文挑战第21天】LangChain是一个框架,利用“链”的概念将复杂的任务分解为可管理的部分,便于构建智能应用。数据科学家可以通过组合不同组件来处理和分析非结构化数据。示例中展示了如何使用LLMChain结合OpenAI的GPT-3.5-turbo模型,创建提示模板以生成公司名称和描述。顺序链(SimpleSequentialChain和SequentialChain)则允许按顺序执行多个步骤,处理多个输入和输出
1030 1
|
7月前
|
存储 人工智能 搜索推荐
【LangChain系列】第六篇:内存管理简介及实践
【5月更文挑战第20天】【LangChain系列】第六篇:内存管理简介及实践
209 0
【LangChain系列】第六篇:内存管理简介及实践
|
7月前
|
Shell Android开发
Android系统 adb shell push/pull 禁止特定文件
Android系统 adb shell push/pull 禁止特定文件
603 1
|
7月前
|
Android开发 Python
Python封装ADB获取Android设备wifi地址的方法
Python封装ADB获取Android设备wifi地址的方法
166 0

热门文章

最新文章

下一篇
DataWorks