← Back to blog
LangChainFebruary 21, 20269 min read

How to Protect PII in LangChain Pipelines

Add PII protection to any LangChain chain in two lines of code. Covers tokenization, RAG document transformers, compliance policies, and EU data residency with langchain-blindfold.

LangChain makes it remarkably easy to build AI-powered applications. Chain a prompt template to an LLM, add a retriever, wire up an agent — and you have a working pipeline in minutes. But every chain that processes user input carries a hidden risk: the personal data in that input gets sent directly to your LLM provider.

Names, email addresses, phone numbers, social security numbers — anything a user types into your chatbot, RAG pipeline, or agent workflow ends up in an API call to OpenAI, Anthropic, or whichever provider you use. That is a compliance problem under GDPR, HIPAA, and most other data protection frameworks. It is also a trust problem: your users expect you to handle their data responsibly.

This article shows you how to add PII protection to any LangChain chain using langchain-blindfold. You will learn how to wrap chains with automatic tokenization, protect documents in RAG pipelines, and apply compliance policies — all without changing your existing LangChain code.

The Problem with Unprotected Chains

Consider a typical LangChain chain that answers customer questions. This is the kind of code you will find in most tutorials:

python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{input}"),
])

llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm | StrOutputParser()

# User input goes straight to OpenAI
response = chain.invoke({
    "input": "My name is Sarah Chen, my email is sarah.chen@acme.com, "
            "and my account number is 4829-1038-2847. Why was I charged twice?"
})

When this chain runs, the entire user message — including Sarah's name, email, and account number — is sent to OpenAI's API in plaintext. OpenAI now has that data in their logs, even if only temporarily. You have no control over what happens to it.

This pattern repeats across every LangChain application that takes user input: chatbots, support agents, RAG pipelines over customer documents, summarization tools. If PII goes in, PII goes out to the provider.

Adding PII Protection with langchain-blindfold

The langchain-blindfold package integrates Blindfold's PII detection directly into LangChain's Runnable interface. Install it alongside the Blindfold Python SDK:

bash
pip install langchain-blindfold blindfold-sdk

Set your API key as an environment variable:

bash
export BLINDFOLD_API_KEY="your-api-key"

Now wrap your chain with PII protection using blindfold_protect():

python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_blindfold import blindfold_protect

# Create a paired tokenizer and detokenizer
tokenize, detokenize = blindfold_protect(policy="basic")

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{input}"),
])

llm = ChatOpenAI(model="gpt-4o")

# Wrap the chain: tokenize input, run the chain, detokenize output
chain = tokenize | prompt | llm | StrOutputParser() | detokenize

# PII is now protected automatically
response = chain.invoke(
    "My name is Sarah Chen, my email is sarah.chen@acme.com, "
    "and my account number is 4829-1038-2847. Why was I charged twice?"
)

print(response)
# The response mentions "Sarah Chen" and her email normally,
# but OpenAI only ever saw <Person_1> and <Email Address_1>.

Two lines of code. That is all it takes. The blindfold_protect() function returns a paired tokenizer and detokenizer. You place the tokenizer at the start of your chain and the detokenizer at the end. Everything in between — the prompt, the LLM call, the output parser — works exactly as before, but with tokenized text instead of real PII.

How It Works

Both BlindfoldTokenizer and BlindfoldDetokenizer are LangChain Runnable objects. They implement invoke() and can be composed with the pipe operator just like any other LangChain component. Here is what happens at each step of the chain:

  1. Tokenize. The BlindfoldTokenizer sends the input text to the Blindfold API. PII entities are detected and replaced with tokens like <Person_1>, <Email Address_1>, <Phone Number_1>. The mapping between tokens and original values is stored internally.
  2. Prompt + LLM. The tokenized text flows through your prompt template and into the LLM. The model sees tokens, not real data. It reasons about the text normally and uses the tokens in its response.
  3. Detokenize. The BlindfoldDetokenizer reads the stored mapping and replaces every token in the LLM's output with the original value. This is a local string replacement — no API call needed. The mapping is cleared after use.

Here is a concrete example of what the LLM actually sees versus what the user sees:

What the user sends

python
"My name is Sarah Chen, my email is sarah.chen@acme.com, and my account number is 4829-1038-2847. Why was I charged twice?"

What the LLM receives (after tokenization)

python
"My name is <Person_1>, my email is <Email Address_1>, and my account number is <Credit Card Number_1>. Why was I charged twice?"

What the LLM responds

python
"I'm sorry to hear about the double charge, <Person_1>. I've looked into your account and can see the duplicate transaction. I'll process a refund to the card ending in <Credit Card Number_1> and send confirmation to <Email Address_1>."

What the user sees (after detokenization)

python
"I'm sorry to hear about the double charge, Sarah Chen. I've looked into your account and can see the duplicate transaction. I'll process a refund to the card ending in 4829-1038-2847 and send confirmation to sarah.chen@acme.com."

Key point: The LLM never sees any real personal data. It works entirely with tokens, which it treats as opaque placeholders. The quality of the response is unaffected because the model still understands the structure and context of the message.

Using the Runnables Directly

The blindfold_protect() helper is the easiest way to get started, but you can also instantiate BlindfoldTokenizer and BlindfoldDetokenizer directly for more control:

python
from langchain_blindfold import BlindfoldTokenizer, BlindfoldDetokenizer

# Fine-grained control over tokenizer settings
tokenizer = BlindfoldTokenizer(
    api_key="your-api-key",
    region="eu",
    policy="gdpr_eu",
    score_threshold=0.5,
)

detokenizer = BlindfoldDetokenizer(tokenizer=tokenizer)

# Use them in a chain
chain = tokenizer | prompt | llm | StrOutputParser() | detokenizer

# Or invoke them independently
tokenized_text = tokenizer.invoke("Contact me at john@example.com")
print(tokenized_text)
# "Contact me at <Email Address_1>"

# Check the stored mapping
print(tokenizer.get_mapping())
# {"<Email Address_1>": "john@example.com"}

The BlindfoldDetokenizer takes a reference to the tokenizer and reads its mapping to perform the reverse replacement. This is done locally — no additional API call is made for detokenization. The mapping is automatically cleared after each detokenization to prevent data leaking between requests.

Protecting Documents in RAG Pipelines

Chains are not the only place where PII leaks. In Retrieval-Augmented Generation (RAG) pipelines, documents loaded from databases, PDFs, or APIs often contain personal data. When those documents are embedded and stored in a vector store, the PII gets baked into your index. When they are retrieved and injected into prompts, the PII goes to the LLM.

The BlindfoldPIITransformer solves this by protecting PII in LangChain Document objects before they go into the vector store. It implements LangChain's BaseDocumentTransformer interface, so it drops in anywhere you use document transformers today.

python
from langchain_blindfold import BlindfoldPIITransformer
from langchain_core.documents import Document

# Create a transformer that tokenizes PII in documents
transformer = BlindfoldPIITransformer(
    pii_method="tokenize",
    policy="basic",
)

# Your documents from any loader
docs = [
    Document(
        page_content="Customer Sarah Chen (sarah.chen@acme.com) reported "
                     "an outage on Feb 10. Her phone: +1-555-234-5678.",
        metadata={"source": "tickets.csv", "row": 42},
    ),
    Document(
        page_content="Order #9912 for James Rivera, 742 Evergreen Terrace, "
                     "Springfield IL 62704. Payment via card ending 4821.",
        metadata={"source": "orders.csv", "row": 108},
    ),
]

# Transform: PII in page_content is replaced with tokens
safe_docs = transformer.transform_documents(docs)

for doc in safe_docs:
    print(doc.page_content)
    print(doc.metadata)
    print()

# Output:
# Customer <Person_1> (<Email Address_1>) reported an outage on Feb 10.
# Her phone: <Phone Number_1>.
# {"source": "tickets.csv", "row": 42, "blindfold_mapping": {"<Person_1>": "Sarah Chen", ...}}
#
# Order #9912 for <Person_1>, <Address_1>.
# Payment via card ending <Credit Card Number_1>.
# {"source": "orders.csv", "row": 108, "blindfold_mapping": {"<Person_1>": "James Rivera", ...}}

When you use the tokenize method, the mapping between tokens and original values is stored in each document's metadata["blindfold_mapping"]. This means you can reverse the tokenization later when you retrieve documents and need to display original values to authorized users.

Using it in a RAG ingestion pipeline

Here is how BlindfoldPIITransformer fits into a typical RAG ingestion flow:

python
from langchain_community.document_loaders import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_blindfold import BlindfoldPIITransformer

# 1. Load documents
loader = CSVLoader("support_tickets.csv")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)

# 3. Protect PII before indexing
transformer = BlindfoldPIITransformer(
    pii_method="tokenize",
    policy="basic",
)
safe_chunks = transformer.transform_documents(chunks)

# 4. Index into vector store — no PII in the embeddings
vectorstore = Chroma.from_documents(
    documents=safe_chunks,
    embedding=OpenAIEmbeddings(),
)

Now your vector store contains tokenized text. When documents are retrieved during a query, the context injected into the LLM prompt is already PII-free. You get accurate semantic search (the embeddings still capture the meaning of the text) without any personal data leaking into your LLM calls.

Other protection methods

The pii_method parameter controls how PII is handled. Depending on your use case, you might not need the mapping at all:

  • tokenize — Replace with reversible tokens. Mapping stored in metadata.
  • redact — Replace with [REDACTED]. Irreversible. Good for analytics pipelines where you never need the originals.
  • mask — Partially mask values (e.g., S***h C**n). Useful when you need to keep some context visible.
  • synthesize — Replace with realistic fake data. Great for testing and development environments.
  • encrypt — Replace with encrypted values. Reversible only with your key.

Compliance Policies

Different regulations require you to detect different types of personal data. Blindfold ships with pre-configured detection policies that target the entity types relevant to specific compliance frameworks. You set the policy once, and the right entities are detected automatically.

PolicyUse caseKey entity types
basicGeneral-purpose PII protectionPerson, Email Address, Phone Number, Address, Credit Card Number
gdpr_euEU data protection (GDPR)Person, Email Address, Phone Number, Address, Iban Code, Date Of Birth, National ID
hipaa_usUS healthcare (HIPAA Safe Harbor)All 18 HIPAA identifiers: Person, SSN, Medical Record Number, Date Of Birth, Address, Phone Number, and more
pci_dssPayment card data protectionCredit Card Number, CVV, Expiration Date, Cardholder Name
strictMaximum detection coverageAll supported entity types with the lowest threshold

Using policies in your LangChain chain is straightforward — just pass the policy name to blindfold_protect() or to the individual runnables:

python
from langchain_blindfold import blindfold_protect

# For a healthcare chatbot
tokenize, detokenize = blindfold_protect(policy="hipaa_us", region="us")
chain = tokenize | prompt | llm | StrOutputParser() | detokenize

# For an EU customer support agent
tokenize, detokenize = blindfold_protect(policy="gdpr_eu", region="eu")
chain = tokenize | prompt | llm | StrOutputParser() | detokenize

# For a payment processing assistant
tokenize, detokenize = blindfold_protect(policy="pci_dss")
chain = tokenize | prompt | llm | StrOutputParser() | detokenize

EU and US Data Residency

Where your PII is processed matters. GDPR restricts cross-border transfers of EU personal data. HIPAA compliance officers typically require US-only processing. Blindfold offers regional API endpoints so you can guarantee data residency:

  • region="eu" — Routes all API calls to eu-api.blindfold.dev. PII is processed and stored exclusively within the EU.
  • region="us" — Routes all API calls to us-api.blindfold.dev. PII is processed and stored exclusively within the US.

Set the region when creating the tokenizer or the document transformer:

python
# Chain with EU data residency
tokenize, detokenize = blindfold_protect(
    policy="gdpr_eu",
    region="eu",
)

# Document transformer with EU data residency
transformer = BlindfoldPIITransformer(
    pii_method="tokenize",
    policy="gdpr_eu",
    region="eu",
)

Important: The region setting controls where Blindfold processes the PII detection and tokenization. The tokenized text that you then send to your LLM provider contains no personal data, so the LLM call itself does not constitute a restricted data transfer — even if the provider operates from a different jurisdiction.

Complete Example: Protected RAG Chatbot

Here is a full example that puts everything together — a RAG chatbot with PII protection at both the ingestion and query layers:

python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_chroma import Chroma
from langchain_blindfold import blindfold_protect, BlindfoldPIITransformer

# --- Ingestion: protect documents before indexing ---
transformer = BlindfoldPIITransformer(
    pii_method="tokenize",
    policy="basic",
)

# Assume 'raw_docs' is loaded from your data source
safe_docs = transformer.transform_documents(raw_docs)

vectorstore = Chroma.from_documents(
    documents=safe_docs,
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

# --- Query: protect user input, retrieve, answer ---
tokenize, detokenize = blindfold_protect(policy="basic")

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question using only the context below.\n\nContext:\n{context}"),
    ("human", "{question}"),
])

llm = ChatOpenAI(model="gpt-4o")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    tokenize
    | {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
    | detokenize
)

# Ask a question with PII — it's protected end-to-end
answer = chain.invoke(
    "What was the resolution for Sarah Chen's outage ticket?"
)
print(answer)

In this pipeline, PII is protected at two levels. The documents in the vector store have already been tokenized during ingestion. And the user's question is tokenized before it reaches the retriever and LLM. The final answer is detokenized so the user sees natural language with real names and details. At no point does your LLM provider see any real personal data.

Multi-Language Support

Blindfold's PII detection works across 18+ languages, including English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Czech, Slovak, Danish, Swedish, Norwegian, and Romanian. This is especially useful if your LangChain application serves international users or processes multilingual documents. No language parameter is needed — detection is automatic.

Try It Yourself

Getting started takes less than a minute:

bash
pip install langchain-blindfold blindfold-sdk

The free tier includes 1M characters per month, which is enough to test thoroughly and run lightweight production workloads. No credit card required.

Already using LangChain? Adding PII protection does not require restructuring your chains. The tokenizer and detokenizer are standard LangChain Runnables that compose with the pipe operator. Your existing prompt templates, LLM configurations, and output parsers stay exactly the same.

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.