[toc]
今天看来,数据隐私非常重要,尤其是在使用大型语言模型(LLMs)和敏感信息时。公司和个人经常需要使用私人数据,比如个人可识别信息(PII),用于他们的LLM应用程序中。然而,数据泄露和隐私侵犯的风险是一个持续的威胁,这使得实施数据保护措施变得必要。
在这篇博客文章中,我们将探讨保护私人数据的解决方案,用于构建使用LLMs的问答系统。我们将深入探讨数据匿名化的概念,这涉及在将数据提供给LLM之前,用合成数据或占位符替换敏感信息。通过使用微软的LangChain和Presidio库,我们可以为我们的LLM应用程序创建一个安全且可定制的匿名化流水线。
一、什么是PII
个人可识别信息(PII)指的是任何可以用来识别、联系或定位个人的数据。PII可分为两种类型:
- 关联信息:这包括直接标识符,如电子邮件地址、电话号码、社会安全号码和护照号码。
- 可关联信息:这指的是间接信息,可以与其他数据结合以识别个人。例如,年龄、职业和位置的组合可能唯一地识别某人。
保护个人可识别信息免受潜在数据泄露或未经授权访问是至关重要的,因为后果可能严重,包括身份盗用、金融诈骗和法律责任。
二、如何保护数据
在使用OpenAI或Anthropic等外部API时,我们的数据可能存在泄霏或存储一定时间(如30天)的风险。即使我们托管自己的LLM实例,仍然存在数据泄露或在训练过程中模型记住敏感信息的风险。为了避免这些风险,我们有两个主要选择:
托管自己的LLM:这使我们能够将数据保留在本地,但可能成本高昂,并且可用模型可能无法与GPT-4o或其他最先进的LLM的性能匹配。
在将数据提供给LLM之前进行匿名化:通过用占位符或合成数据替换敏感信息,我们可以在使用外部LLM或API时保护我们的隐私数据。在本博客文章中,我们将专注于第二个选项:使用LangChain和Presidio进行数据匿名化。
三、Presidio介绍
Presidio是Microsoft开源的一个库,为文本数据提供了强大且可定制的匿名化工具。它由两个主要组件组成:
分析器:此组件使用内置模式、正则表达式和命名实体识别模型,识别和识别文本中的PII实体。
匿名化器:此组件用占位符、标记或合成数据替换识别的PII实体。
Presidio提供高度可定制性,允许我们添加自定义识别器和运算符来处理特定数据格式或要求。
四、使用LangChain和Presidio进行匿名化
让我们深入代码,探索如何将Presidio与LangChain集成,创建一个具有数据匿名化功能的安全问答系统。
1.初始化匿名器
首先,我们需要从LangChain初始化PresidioReversibleAnonymizer,它提供了在匿名化后恢复原始数据的功能。
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
import re
anonymizer = PresidioReversibleAnonymizer(
add_default_faker_operators=False,
)
2.匿名化数据
现在,我们可以通过用占位符或标记替换已识别的PII实体来对文本进行匿名化处理:
from langchain_core.documents import Document
def print_colored_pii(string):
colored_string = re.sub(
r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
)
print(colored_string)
document_content = """Date: October 19, 2021
Witness: John Doe
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.
Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532.
What's more, I had my polish identity card there, with the number ABC123456.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.
In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.
My representative there is Victoria Cherry (her business phone: 987-654-3210).
Thank you for your assistance,
John Doe"""
documents = [Document(page_content=document_content)]
anonymized_text = anonymizer.anonymize(document_content)
print_colored_pii(anonymized_text)
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.
Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>.
What's more, I had my polish identity card there, with the number ABC123456.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at <DATE_TIME_2>.
In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.
Please consider this information to be highly confidential
这将输出带有标记的匿名文本,如、等,替换敏感信息。
from pprint import pprint
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},
'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
'<EMAIL_ADDRESS_2>': 'support@bankname.com'},
'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
'LOCATION': {'<LOCATION>': 'Kilmarnock'},
'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
'UK_NHS': {'<UK_NHS>': '987-654-3210'},
'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
'US_SSN': {'<US_SSN>': '602-76-4532'}}
3.添加自动识别器
由于 Presidio 默认情况下可能无法识别某些 PII 实体,我们可以添加自定义识别器来处理特定的数据格式。例如,让我们为波兰身份证号码和时间格式创建识别器:
from presidio_analyzer import Pattern, PatternRecognizer
polish_id_pattern = Pattern(
name="polish_id_pattern",
regex="[A-Z]{3}\d{6}",
score=1,
)
time_pattern = Pattern(
name="time_pattern",
regex="(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)",
score=1,
)
polish_id_recognizer = PatternRecognizer(
supported_entity="POLISH_ID", patterns=[polish_id_pattern]
)
time_recognizer = PatternRecognizer(supported_entity="TIME", patterns=[time_pattern])
anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)
# Note that our anonymization instance remembers previously detected and anonymized values,
# including those that were not detected correctly (e.g., "9:30 AM" taken as DATE_TIME). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.
Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>.
What's more, I had my polish identity card there, with the number <POLISH_ID>.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at <TIME>.
In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, <EMAIL_ADDRESS_2>.
My representative there is <PERSON_2> (her business phone: <UK_NHS>).
Thank you for your assistance,
<PERSON>
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
'<EMAIL_ADDRESS_2>': 'support@bankname.com'},
'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
'LOCATION': {'<LOCATION>': 'Kilmarnock'},
'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},
'TIME': {'<TIME>': '9:30 AM'},
'UK_NHS': {'<UK_NHS>': '987-654-3210'},
'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
'US_SSN': {'<US_SSN>': '602-76-4532'}}
我们的新识别器工作正常。匿名器已将时间和波兰身份证实体替换为<TIME>
和<POLISH_ID>
标记,并且去匿名器映射也已相应更新。
4. 综合原始值
现在,当正确检测到所有 PII 值时,我们可以继续下一步,即用合成值替换原始值。为此,我们需要设置add_default_faker_operators=True
(或者只是删除此参数,因为它True
默认设置为):
anonymizer = PresidioReversibleAnonymizer(
add_default_faker_operators=True,
# Faker seed is used here to make sure the same fake data is generated for the test purposes
# In production, it is recommended to remove the faker_seed parameter (it will default to None)
faker_seed=42,
)
anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1986-04-18
Witness: Brian Cox DVM
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is Brian Cox DVM and on 1986-04-18, my wallet was stolen in the vicinity of New Rita during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 6584801845146275, which is registered under my name and linked to my bank account, GB78GSWK37672423884969.
Additionally, the wallet had a driver's license - DL No: 781802744 issued to my name. It also houses my Social Security Number, 687-35-1170.
What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.
In case any information arises regarding my wallet, please reach out to me on my phone number, 7344131647, or through my personal email, jamesmichael@example.com.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, blakeerik@example.com.
My representative there is Cristian Santos (her business phone: 2812140441).
Thank you for your assistance,
Brian Cox DVM
如您所见,几乎所有值都已被替换为合成值。唯一的例外是波兰身份证号码和时间,默认的伪造者操作员不支持这些值。我们可以向匿名器添加新的操作员,这将生成随机数据。
5.添加自定义运算符(可选)
虽然使用占位符或标记是一种有效的方法,但通常最好用合成数据替换 PII 实体,以提高 LLM 的性能。我们可以向匿名器添加自定义运算符,以生成特定实体类型的合成数据:
from faker import Faker
from presidio_anonymizer.entities import OperatorConfig
fake = Faker()
def fake_polish_id(_=None):
""" Example output: 'VTC592627'"""
return fake.bothify(text="???######").upper()
def fake_time(_=None):
""" Example output: '03:14 PM'"""
return fake.time(pattern="%I:%M %p")
new_operators = {
"POLISH_ID": OperatorConfig("custom", {
"lambda": fake_polish_id}),
"TIME": OperatorConfig("custom", {
"lambda": fake_time}),
}
anonymizer.add_operators(new_operators)
现在,当我们将文本匿名化时,波兰 ID 和时间实体将被替换为由我们的自定义操作员生成的合成数据。
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1974-12-26
Witness: Jimmy Murillo
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is Jimmy Murillo and on 1974-12-26, my wallet was stolen in the vicinity of South Dianeshire during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 213108121913614, which is registered under my name and linked to my bank account, GB17DBUR01326773602606.
Additionally, the wallet had a driver's license - DL No: 532311310 issued to my name. It also houses my Social Security Number, 690-84-1613.
What's more, I had my polish identity card there, with the number UFB745084.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at 11:54 AM.
In case any information arises regarding my wallet, please reach out to me on my phone number, 876.931.1656, or through my personal email, briannasmith@example.net.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, samuel87@example.org.
My representative there is Joshua Blair (her business phone: 3361388464).
Thank you for your assistance,
Jimmy Murillo
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'213108121913614': '4111 1111 1111 1111'},
'DATE_TIME': {'1974-12-26': 'October 19, 2021'},
'EMAIL_ADDRESS': {'briannasmith@example.net': 'johndoe@example.com',
'samuel87@example.org': 'support@bankname.com'},
'IBAN_CODE': {'GB17DBUR01326773602606': 'PL61109010140000071219812874'},
'LOCATION': {'South Dianeshire': 'Kilmarnock'},
'PERSON': {'Jimmy Murillo': 'John Doe', 'Joshua Blair': 'Victoria Cherry'},
'PHONE_NUMBER': {'876.931.1656': '999-888-7777'},
'POLISH_ID': {'UFB745084': 'ABC123456'},
'TIME': {'11:54 AM': '9:30 AM'},
'UK_NHS': {'3361388464': '987-654-3210'},
'US_DRIVER_LICENSE': {'532311310': '999000680'},
'US_SSN': {'690-84-1613': '602-76-4532'}}
现在所有值都替换为合成值。请注意,去匿名器映射已相应更新。
6. 建立问答系统
设置匿名器后,我们可以使用 LangChain 将其集成到问答系统中:
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load and anonymize the data
documents = [Document(page_content=document_content)]
for doc in documents:
doc.page_content = anonymizer.anonymize(doc.page_content)
# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
# Index the chunks (using OpenAI embeddings, since the data is already anonymized)
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(chunks, embeddings)
retriever = docsearch.as_retriever()
# Create the anonymizer chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI
template = """Answer the question based only on the following context: {context}
Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0.3)
_inputs = RunnableParallel(
question=RunnablePassthrough(),
anonymized_question=RunnableLambda(anonymizer.anonymize),
)
anonymizer_chain = (
_inputs
| {
"context": itemgetter("anonymized_question") | retriever,
"anonymized_question": itemgetter("anonymized_question"),
}
| prompt
| model
| StrOutputParser()
)
anonymizer_chain.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?")
# 'The theft of the wallet occurred in the vicinity of New Rita during a bike trip. It was stolen from Brian Cox DVM. The time of the theft was 02:22 AM.'
# Add deanonymization step to the chain
chain_with_deanonymization = anonymizer_chain | RunnableLambda(anonymizer.deanonymize)
print(chain_with_deanonymization.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?"))
# The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.
print(chain_with_deanonymization.invoke("What was the content of the wallet in detail?"))
# The content of the wallet included a credit card with the number 4111 1111 1111 1111, registered under the name of John Doe and linked to the bank account PL61109010140000071219812874. It also contained a driver's license with the number 999000680 issued to John Doe, as well as his Social Security Number 602-76-4532. Additionally, the wallet had a Polish identity card with the number ABC123456.
print(chain_with_deanonymization.invoke("Whose phone number is it: 999-888-7777?"))
# The phone number 999-888-7777 belongs to John Doe.
五、局部嵌入
如果您希望以原始形式对数据进行索引或使用自定义嵌入,则可以采用替代方法,即在索引后对上下文进行匿名化:
from operator import itemgetter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_core.prompts import format_document
from langchain_core.prompts.prompt import PromptTemplate
model_name = "BAAI/bge-base-en-v1.5"
encode_kwargs = {
"normalize_embeddings": True}
local_embeddings = HuggingFaceBgeEmbeddings(
model_name=model_name, encode_kwargs=encode_kwargs, query_instruction="Represent this sentence for searching relevant passages:",
)
documents = [Document(page_content=document_content)]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
docsearch = FAISS.from_documents(chunks, local_embeddings)
retriever = docsearch.as_retriever()
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
def _combine_documents(docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)
# Anonymize the context after retrieval
chain_with_deanonymization = (
RunnableParallel({
"question": RunnablePassthrough()})
| {
"context": itemgetter("question")
| retriever
| _combine_documents
| anonymizer.anonymize,
"anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
}
| prompt
| model
| StrOutputParser()
| RunnableLambda(anonymizer.deanonymize)
)
该方法从矢量数据库中检索原始上下文,即时将其匿名化,然后执行与之前相同的匿名化和去匿名化步骤。
小结
在这篇博文中,我们探讨了在使用 LLM 构建问答系统时保护私人数据的解决方案。通过使用 LangChain 和 Presidio 库,我们可以创建一个安全且可定制的匿名化管道,在将敏感信息输入到 LLM 之前,用占位符或合成数据替换敏感信息。我们介绍了 PII、数据匿名化和 Presidio 库的基本概念。然后我们详细介绍了实现细节,包括初始化匿名器、添加自定义识别器和运算符,以及使用 LangChain 将匿名化过程集成到问答系统中。通过采用这种方法,我们可以确保我们的私人数据受到保护,同时仍然受益于最先进的法学硕士和外部 API 的功能。
小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!