【LangChain系列】第十篇:数据保护简介及实践

本文涉及的产品
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
交互式建模 PAI-DSW,5000CU*H 3个月
模型训练 PAI-DLC,5000CU*H 3个月
简介: 【5月更文挑战第24天】本文探讨了在使用大型语言模型时保护个人数据的重要性,特别是涉及敏感信息如PII(个人可识别信息)的情况。为了降低数据泄露风险,文章介绍了数据匿名化的概念,通过在数据进入LLM前替换敏感信息。重点讲解了Microsoft的Presidio库,它提供了一个可定制的文本匿名化工具。此外,文章还展示了如何结合LangChain库创建一个安全的匿名化流水线,包括初始化匿名器、添加自定义识别器和操作符,以及在问答系统中集成匿名化流程。通过这种方式,可以在利用LLMs的同时保护数据隐私。

[toc]


今天看来,数据隐私非常重要,尤其是在使用大型语言模型(LLMs)和敏感信息时。公司和个人经常需要使用私人数据,比如个人可识别信息(PII),用于他们的LLM应用程序中。然而,数据泄露和隐私侵犯的风险是一个持续的威胁,这使得实施数据保护措施变得必要。

在这篇博客文章中,我们将探讨保护私人数据的解决方案,用于构建使用LLMs的问答系统。我们将深入探讨数据匿名化的概念,这涉及在将数据提供给LLM之前,用合成数据或占位符替换敏感信息。通过使用微软的LangChain和Presidio库,我们可以为我们的LLM应用程序创建一个安全且可定制的匿名化流水线。

一、什么是PII

个人可识别信息(PII)指的是任何可以用来识别、联系或定位个人的数据。PII可分为两种类型:
image.png

  • 关联信息:这包括直接标识符,如电子邮件地址、电话号码、社会安全号码和护照号码。
  • 可关联信息:这指的是间接信息,可以与其他数据结合以识别个人。例如,年龄、职业和位置的组合可能唯一地识别某人。

保护个人可识别信息免受潜在数据泄露或未经授权访问是至关重要的,因为后果可能严重,包括身份盗用、金融诈骗和法律责任。

二、如何保护数据

image.png

在使用OpenAI或Anthropic等外部API时,我们的数据可能存在泄霏或存储一定时间(如30天)的风险。即使我们托管自己的LLM实例,仍然存在数据泄露或在训练过程中模型记住敏感信息的风险。为了避免这些风险,我们有两个主要选择:

托管自己的LLM:这使我们能够将数据保留在本地,但可能成本高昂,并且可用模型可能无法与GPT-4o或其他最先进的LLM的性能匹配。
image.png

在将数据提供给LLM之前进行匿名化:通过用占位符或合成数据替换敏感信息,我们可以在使用外部LLM或API时保护我们的隐私数据。在本博客文章中,我们将专注于第二个选项:使用LangChain和Presidio进行数据匿名化。

三、Presidio介绍

Presidio是Microsoft开源的一个库,为文本数据提供了强大且可定制的匿名化工具。它由两个主要组件组成:
image.png

  • 分析器:此组件使用内置模式、正则表达式和命名实体识别模型,识别和识别文本中的PII实体。

  • 匿名化器:此组件用占位符、标记或合成数据替换识别的PII实体。

Presidio提供高度可定制性,允许我们添加自定义识别器和运算符来处理特定数据格式或要求。

四、使用LangChain和Presidio进行匿名化

让我们深入代码,探索如何将Presidio与LangChain集成,创建一个具有数据匿名化功能的安全问答系统。

image.png

1.初始化匿名器

首先,我们需要从LangChain初始化PresidioReversibleAnonymizer,它提供了在匿名化后恢复原始数据的功能。

from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
import re

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=False,
)

2.匿名化数据

现在,我们可以通过用占位符或标记替换已识别的PII实体来对文本进行匿名化处理:

from langchain_core.documents import Document

def print_colored_pii(string):
    colored_string = re.sub(
        r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
    )
    print(colored_string)

document_content = """Date: October 19, 2021
 Witness: John Doe
 Subject: Testimony Regarding the Loss of Wallet

 Testimony Content:

 Hello Officer,

 My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.

 Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.

 Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532.

 What's more, I had my polish identity card there, with the number ABC123456.

 I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.

 In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.

 Please consider this information to be highly confidential and respect my privacy.

 The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.
 My representative there is Victoria Cherry (her business phone: 987-654-3210).

 Thank you for your assistance,

 John Doe"""

documents = [Document(page_content=document_content)]

anonymized_text = anonymizer.anonymize(document_content)
print_colored_pii(anonymized_text)
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.

Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>. 

What's more, I had my polish identity card there, with the number ABC123456.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at <DATE_TIME_2>.

In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.

Please consider this information to be highly confidential

这将输出带有标记的匿名文本,如、等,替换敏感信息。

from pprint import pprint
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'UK_NHS': {'<UK_NHS>': '987-654-3210'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}

3.添加自动识别器

由于 Presidio 默认情况下可能无法识别某些 PII 实体,我们可以添加自定义识别器来处理特定的数据格式。例如,让我们为波兰身份证号码和时间格式创建识别器:

from presidio_analyzer import Pattern, PatternRecognizer

polish_id_pattern = Pattern(
    name="polish_id_pattern",
    regex="[A-Z]{3}\d{6}",
    score=1,
)
time_pattern = Pattern(
    name="time_pattern",
    regex="(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)",
    score=1,
)

polish_id_recognizer = PatternRecognizer(
    supported_entity="POLISH_ID", patterns=[polish_id_pattern]
)
time_recognizer = PatternRecognizer(supported_entity="TIME", patterns=[time_pattern])

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

# Note that our anonymization instance remembers previously detected and anonymized values, 
# including those that were not detected correctly (e.g., "9:30 AM" taken as DATE_TIME). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.

Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>. 

What's more, I had my polish identity card there, with the number <POLISH_ID>.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at <TIME>.

In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, <EMAIL_ADDRESS_2>.
My representative there is <PERSON_2> (her business phone: <UK_NHS>).

Thank you for your assistance,

<PERSON>
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},
 'TIME': {'<TIME>': '9:30 AM'},
 'UK_NHS': {'<UK_NHS>': '987-654-3210'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}

我们的新识别器工作正常。匿名器已将时间和波兰身份证实体替换为<TIME><POLISH_ID>标记,并且去匿名器映射也已相应更新。

4. 综合原始值

现在,当正确检测到所有 PII 值时,我们可以继续下一步,即用合成值替换原始值。为此,我们需要设置add_default_faker_operators=True(或者只是删除此参数,因为它True默认设置为):

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=True,
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=42,
)

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

print_colored_pii(anonymizer.anonymize(document_content))
Date: 1986-04-18
Witness: Brian Cox DVM
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Brian Cox DVM and on 1986-04-18, my wallet was stolen in the vicinity of New Rita during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 6584801845146275, which is registered under my name and linked to my bank account, GB78GSWK37672423884969.

Additionally, the wallet had a driver's license - DL No: 781802744 issued to my name. It also houses my Social Security Number, 687-35-1170. 

What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.

In case any information arises regarding my wallet, please reach out to me on my phone number, 7344131647, or through my personal email, jamesmichael@example.com.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, blakeerik@example.com.
My representative there is Cristian Santos (her business phone: 2812140441).

Thank you for your assistance,

Brian Cox DVM

如您所见,几乎所有值都已被替换为合成值。唯一的例外是波兰身份证号码和时间,默认的伪造者操作员不支持这些值。我们可以向匿名器添加新的操作员,这将生成随机数据。

5.添加自定义运算符(可选)

虽然使用占位符或标记是一种有效的方法,但通常最好用合成数据替换 PII 实体,以提高 LLM 的性能。我们可以向匿名器添加自定义运算符,以生成特定实体类型的合成数据:

from faker import Faker
from presidio_anonymizer.entities import OperatorConfig

fake = Faker()

def fake_polish_id(_=None):
    """ Example output: 'VTC592627'""" 
    return fake.bothify(text="???######").upper() 

def fake_time(_=None):
    """ Example output: '03:14 PM'""" 
    return fake.time(pattern="%I:%M %p")

new_operators = {
   
   
    "POLISH_ID": OperatorConfig("custom", {
   
   "lambda": fake_polish_id}),
    "TIME": OperatorConfig("custom", {
   
   "lambda": fake_time}),
}

anonymizer.add_operators(new_operators)

现在,当我们将文本匿名化时,波兰 ID 和时间实体将被替换为由我们的自定义操作员生成的合成数据。

anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1974-12-26
Witness: Jimmy Murillo
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Jimmy Murillo and on 1974-12-26, my wallet was stolen in the vicinity of South Dianeshire during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 213108121913614, which is registered under my name and linked to my bank account, GB17DBUR01326773602606.

Additionally, the wallet had a driver's license - DL No: 532311310 issued to my name. It also houses my Social Security Number, 690-84-1613. 

What's more, I had my polish identity card there, with the number UFB745084.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at 11:54 AM.

In case any information arises regarding my wallet, please reach out to me on my phone number, 876.931.1656, or through my personal email, briannasmith@example.net.

Please consider this information to be highly confidential and respect my privacy. 

The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, samuel87@example.org.
My representative there is Joshua Blair (her business phone: 3361388464).

Thank you for your assistance,

Jimmy Murillo
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'213108121913614': '4111 1111 1111 1111'},
 'DATE_TIME': {'1974-12-26': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'briannasmith@example.net': 'johndoe@example.com',
                   'samuel87@example.org': 'support@bankname.com'},
 'IBAN_CODE': {'GB17DBUR01326773602606': 'PL61109010140000071219812874'},
 'LOCATION': {'South Dianeshire': 'Kilmarnock'},
 'PERSON': {'Jimmy Murillo': 'John Doe', 'Joshua Blair': 'Victoria Cherry'},
 'PHONE_NUMBER': {'876.931.1656': '999-888-7777'},
 'POLISH_ID': {'UFB745084': 'ABC123456'},
 'TIME': {'11:54 AM': '9:30 AM'},
 'UK_NHS': {'3361388464': '987-654-3210'},
 'US_DRIVER_LICENSE': {'532311310': '999000680'},
 'US_SSN': {'690-84-1613': '602-76-4532'}}

现在所有值都替换为合成值。请注意,去匿名器映射已相应更新。

6. 建立问答系统

设置匿名器后,我们可以使用 LangChain 将其集成到问答系统中:

from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and anonymize the data
documents = [Document(page_content=document_content)]
for doc in documents:
    doc.page_content = anonymizer.anonymize(doc.page_content)

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

# Index the chunks (using OpenAI embeddings, since the data is already anonymized)
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(chunks, embeddings)
retriever = docsearch.as_retriever()

# Create the anonymizer chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI

template = """Answer the question based only on the following context: {context}
Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0.3)
_inputs = RunnableParallel(
    question=RunnablePassthrough(),
    anonymized_question=RunnableLambda(anonymizer.anonymize),
)
anonymizer_chain = (
    _inputs
    | {
   
   
        "context": itemgetter("anonymized_question") | retriever,
        "anonymized_question": itemgetter("anonymized_question"),
    }
    | prompt
    | model
    | StrOutputParser()
)
anonymizer_chain.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?")
# 'The theft of the wallet occurred in the vicinity of New Rita during a bike trip. It was stolen from Brian Cox DVM. The time of the theft was 02:22 AM.'

# Add deanonymization step to the chain
chain_with_deanonymization = anonymizer_chain | RunnableLambda(anonymizer.deanonymize)

print(chain_with_deanonymization.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?"))
# The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.
print(chain_with_deanonymization.invoke("What was the content of the wallet in detail?"))
# The content of the wallet included a credit card with the number 4111 1111 1111 1111, registered under the name of John Doe and linked to the bank account PL61109010140000071219812874. It also contained a driver's license with the number 999000680 issued to John Doe, as well as his Social Security Number 602-76-4532. Additionally, the wallet had a Polish identity card with the number ABC123456.
print(chain_with_deanonymization.invoke("Whose phone number is it: 999-888-7777?"))
# The phone number 999-888-7777 belongs to John Doe.

五、局部嵌入

如果您希望以原始形式对数据进行索引或使用自定义嵌入,则可以采用替代方法,即在索引后对上下文进行匿名化:

from operator import itemgetter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_core.prompts import format_document
from langchain_core.prompts.prompt import PromptTemplate

model_name = "BAAI/bge-base-en-v1.5"
encode_kwargs = {
   
   "normalize_embeddings": True}
local_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, encode_kwargs=encode_kwargs, query_instruction="Represent this sentence for searching relevant passages:",
)

documents = [Document(page_content=document_content)]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

docsearch = FAISS.from_documents(chunks, local_embeddings)
retriever = docsearch.as_retriever()

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def _combine_documents(docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

# Anonymize the context after retrieval
chain_with_deanonymization = (
    RunnableParallel({
   
   "question": RunnablePassthrough()})
    | {
   
   
        "context": itemgetter("question")
        | retriever
        | _combine_documents
        | anonymizer.anonymize,
        "anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
    }
    | prompt
    | model
    | StrOutputParser()
    | RunnableLambda(anonymizer.deanonymize)
)

该方法从矢量数据库中检索原始上下文,即时将其匿名化,然后执行与之前相同的匿名化和去匿名化步骤。

小结

在这篇博文中,我们探讨了在使用 LLM 构建问答系统时保护私人数据的解决方案。通过使用 LangChain 和 Presidio 库,我们可以创建一个安全且可定制的匿名化管道,在将敏感信息输入到 LLM 之前,用占位符或合成数据替换敏感信息。我们介绍了 PII、数据匿名化和 Presidio 库的基本概念。然后我们详细介绍了实现细节,包括初始化匿名器、添加自定义识别器和运算符,以及使用 LangChain 将匿名化过程集成到问答系统中。通过采用这种方法,我们可以确保我们的私人数据受到保护,同时仍然受益于最先进的法学硕士和外部 API 的功能。

小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!

相关实践学习
阿里云百炼xAnalyticDB PostgreSQL构建AIGC应用
通过该实验体验在阿里云百炼中构建企业专属知识库构建及应用全流程。同时体验使用ADB-PG向量检索引擎提供专属安全存储,保障企业数据隐私安全。
AnalyticDB PostgreSQL 企业智能数据中台:一站式管理数据服务资产
企业在数据仓库之上可构建丰富的数据服务用以支持数据应用及业务场景;ADB PG推出全新企业智能数据平台,用以帮助用户一站式的管理企业数据服务资产,包括创建, 管理,探索, 监控等; 助力企业在现有平台之上快速构建起数据服务资产体系
目录
相关文章
|
2月前
|
机器学习/深度学习 人工智能
【LangChain系列】第九篇:LLM 应用评估简介及实践
【5月更文挑战第23天】本文探讨了如何评估复杂且精密的语言模型(LLMs)应用。通过创建QA应用程序,如使用GPT-3.5-Turbo模型,然后构建测试数据,包括手动创建和使用LLM生成示例。接着,通过手动评估、调试及LLM辅助评估来衡量性能。手动评估借助langchain.debug工具提供执行细节,而QAEvalChain则利用LLM的语义理解能力进行评分。这些方法有助于优化和提升LLM应用程序的准确性和效率。
358 8
|
2月前
|
存储 机器学习/深度学习 人工智能
【LangChain系列】第八篇:文档问答简介及实践
【5月更文挑战第22天】本文探讨了如何使用大型语言模型(LLM)进行文档问答,通过结合LLM与外部数据源提高灵活性。 LangChain库被介绍为简化这一过程的工具,它涵盖了嵌入、向量存储和不同类型的检索问答链,如Stuff、Map-reduce、Refine和Map-rerank。文章通过示例展示了如何使用LLM从CSV文件中提取信息并以Markdown格式展示
119 2
|
2月前
|
机器学习/深度学习 自然语言处理 数据挖掘
【LangChain系列】第七篇:工作流(链)简介及实践
【5月更文挑战第21天】LangChain是一个框架,利用“链”的概念将复杂的任务分解为可管理的部分,便于构建智能应用。数据科学家可以通过组合不同组件来处理和分析非结构化数据。示例中展示了如何使用LLMChain结合OpenAI的GPT-3.5-turbo模型,创建提示模板以生成公司名称和描述。顺序链(SimpleSequentialChain和SequentialChain)则允许按顺序执行多个步骤,处理多个输入和输出
417 1
|
2月前
|
存储 人工智能 搜索推荐
【LangChain系列】第六篇:内存管理简介及实践
【5月更文挑战第20天】【LangChain系列】第六篇:内存管理简介及实践
100 0
【LangChain系列】第六篇:内存管理简介及实践
|
2月前
|
存储 机器学习/深度学习 人工智能
【LangChain系列】第一篇:文档加载简介及实践
【5月更文挑战第14天】 LangChain提供80多种文档加载器,简化了从PDF、网站、YouTube视频和Notion等多来源加载与标准化数据的过程。这些加载器将不同格式的数据转化为标准文档对象,便于机器学习工作流程中的数据处理。文中介绍了非结构化、专有和结构化数据的加载示例,包括PDF、YouTube视频、网站和Notion数据库的加载方法。通过LangChain,用户能轻松集成和交互各类数据源,加速智能应用的开发。
199 1
|
2月前
|
机器学习/深度学习 人工智能 自然语言处理
【LangChain系列】第五篇:大语言模型中的提示词,模型及输出简介及实践
【5月更文挑战第19天】LangChain是一个Python库,简化了与大型语言模型(LLM)如GPT-3.5-turbo的交互。通过ChatOpenAI类,开发者可以创建确定性输出的应用。提示词是指导LLM执行任务的关键,ChatPromptTemplate允许创建可重用的提示模板。输出解析器如StructuredOutputParser将模型的响应转化为结构化数据,便于应用处理。LangChain提供可重用性、一致性、可扩展性,并有一系列预建功能。它使得利用LLM构建复杂、直观的应用变得更加容易。
150 0
|
2月前
|
存储 人工智能 数据库
【LangChain系列】第四篇:向量数据库与嵌入简介及实践
【5月更文挑战第18天】 本文介绍了构建聊天机器人和语义搜索的关键组件——向量存储和嵌入。首先,文章描述了工作流程,包括文档拆分、生成嵌入和存储在向量数据库中。接着,通过Python代码展示了如何设置环境并处理文档,以及如何创建和比较文本嵌入。向量存储部分,文章使用Chroma存储嵌入,并进行了相似性检索的演示。最后,讨论了故障模式,如重复文档和未捕获结构化信息的问题。整个博文中,作者强调了在实际应用中解决这些问题的重要性。
237 0
|
2月前
|
人工智能 自然语言处理 API
【LangChain系列】第三篇:Agent代理简介及实践
【5月更文挑战第17天】LangChain代理利用大型语言模型(LLM)作为推理引擎,结合各种工具和数据库,处理复杂任务和决策。这些代理能理解和生成人类语言,访问外部信息,并结合LLM进行推理。文章介绍了如何通过LangChain构建代理,包括集成DuckDuckGo搜索和维基百科,以及创建Python REPL工具执行编程任务。此外,还展示了如何构建自定义工具,如获取当前日期的示例,强调了LangChain的灵活性和可扩展性,为LLM的应用开辟了新途径。
184 0
|
2月前
|
测试技术 API 数据库
【LangChain系列】第二篇:文档拆分简介及实践
【5月更文挑战第15天】 本文介绍了LangChain中文档拆分的重要性及工作原理。文档拆分有助于保持语义内容的完整性,对于依赖上下文的任务尤其关键。LangChain提供了多种拆分器,如CharacterTextSplitter、RecursiveCharacterTextSplitter和TokenTextSplitter,分别适用于不同场景。MarkdownHeaderTextSplitter则能根据Markdown标题结构进行拆分,保留文档结构。通过实例展示了如何使用这些拆分器,强调了选择合适拆分器对提升下游任务性能和准确性的影响。
205 0
|
2月前
|
Shell Android开发
Android系统 adb shell push/pull 禁止特定文件
Android系统 adb shell push/pull 禁止特定文件
149 1