How to Protect PII in LangChain Pipelines
Add PII protection to any LangChain chain in two lines of code. Covers tokenization, RAG document transformers, compliance policies, and EU data residency with langchain-blindfold.
LangChain makes it remarkably easy to build AI-powered applications. Chain a prompt template to an LLM, add a retriever, wire up an agent — and you have a working pipeline in minutes. But every chain that processes user input carries a hidden risk: the personal data in that input gets sent directly to your LLM provider.
Names, email addresses, phone numbers, social security numbers — anything a user types into your chatbot, RAG pipeline, or agent workflow ends up in an API call to OpenAI, Anthropic, or whichever provider you use. That is a compliance problem under GDPR, HIPAA, and most other data protection frameworks. It is also a trust problem: your users expect you to handle their data responsibly.
This article shows you how to add PII protection to any LangChain chain using langchain-blindfold. You will learn how to wrap chains with automatic tokenization, protect documents in RAG pipelines, and apply compliance policies — all without changing your existing LangChain code.
The Problem with Unprotected Chains
Consider a typical LangChain chain that answers customer questions. This is the kind of code you will find in most tutorials:
from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful customer support assistant."), ("human", "{input}"), ]) llm = ChatOpenAI(model="gpt-4o") chain = prompt | llm | StrOutputParser() # User input goes straight to OpenAI response = chain.invoke({ "input": "My name is Sarah Chen, my email is sarah.chen@acme.com, " "and my account number is 4829-1038-2847. Why was I charged twice?" })
When this chain runs, the entire user message — including Sarah's name, email, and account number — is sent to OpenAI's API in plaintext. OpenAI now has that data in their logs, even if only temporarily. You have no control over what happens to it.
This pattern repeats across every LangChain application that takes user input: chatbots, support agents, RAG pipelines over customer documents, summarization tools. If PII goes in, PII goes out to the provider.
Adding PII Protection with langchain-blindfold
The langchain-blindfold package integrates Blindfold's PII detection directly into LangChain's Runnable interface. Install it alongside the Blindfold Python SDK:
pip install langchain-blindfold blindfold-sdk
Set your API key as an environment variable:
export BLINDFOLD_API_KEY="your-api-key"
Now wrap your chain with PII protection using blindfold_protect():
from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_blindfold import blindfold_protect # Create a paired tokenizer and detokenizer tokenize, detokenize = blindfold_protect(policy="basic") prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful customer support assistant."), ("human", "{input}"), ]) llm = ChatOpenAI(model="gpt-4o") # Wrap the chain: tokenize input, run the chain, detokenize output chain = tokenize | prompt | llm | StrOutputParser() | detokenize # PII is now protected automatically response = chain.invoke( "My name is Sarah Chen, my email is sarah.chen@acme.com, " "and my account number is 4829-1038-2847. Why was I charged twice?" ) print(response) # The response mentions "Sarah Chen" and her email normally, # but OpenAI only ever saw <Person_1> and <Email Address_1>.
Two lines of code. That is all it takes. The blindfold_protect() function returns a paired tokenizer and detokenizer. You place the tokenizer at the start of your chain and the detokenizer at the end. Everything in between — the prompt, the LLM call, the output parser — works exactly as before, but with tokenized text instead of real PII.
How It Works
Both BlindfoldTokenizer and BlindfoldDetokenizer are LangChain Runnable objects. They implement invoke() and can be composed with the pipe operator just like any other LangChain component. Here is what happens at each step of the chain:
- Tokenize. The
BlindfoldTokenizersends the input text to the Blindfold API. PII entities are detected and replaced with tokens like<Person_1>,<Email Address_1>,<Phone Number_1>. The mapping between tokens and original values is stored internally. - Prompt + LLM. The tokenized text flows through your prompt template and into the LLM. The model sees tokens, not real data. It reasons about the text normally and uses the tokens in its response.
- Detokenize. The
BlindfoldDetokenizerreads the stored mapping and replaces every token in the LLM's output with the original value. This is a local string replacement — no API call needed. The mapping is cleared after use.
Here is a concrete example of what the LLM actually sees versus what the user sees:
What the user sends
"My name is Sarah Chen, my email is sarah.chen@acme.com, and my account number is 4829-1038-2847. Why was I charged twice?"What the LLM receives (after tokenization)
"My name is <Person_1>, my email is <Email Address_1>, and my account number is <Credit Card Number_1>. Why was I charged twice?"What the LLM responds
"I'm sorry to hear about the double charge, <Person_1>. I've looked into your account and can see the duplicate transaction. I'll process a refund to the card ending in <Credit Card Number_1> and send confirmation to <Email Address_1>."What the user sees (after detokenization)
"I'm sorry to hear about the double charge, Sarah Chen. I've looked into your account and can see the duplicate transaction. I'll process a refund to the card ending in 4829-1038-2847 and send confirmation to sarah.chen@acme.com."Key point: The LLM never sees any real personal data. It works entirely with tokens, which it treats as opaque placeholders. The quality of the response is unaffected because the model still understands the structure and context of the message.
Using the Runnables Directly
The blindfold_protect() helper is the easiest way to get started, but you can also instantiate BlindfoldTokenizer and BlindfoldDetokenizer directly for more control:
from langchain_blindfold import BlindfoldTokenizer, BlindfoldDetokenizer # Fine-grained control over tokenizer settings tokenizer = BlindfoldTokenizer( api_key="your-api-key", region="eu", policy="gdpr_eu", score_threshold=0.5, ) detokenizer = BlindfoldDetokenizer(tokenizer=tokenizer) # Use them in a chain chain = tokenizer | prompt | llm | StrOutputParser() | detokenizer # Or invoke them independently tokenized_text = tokenizer.invoke("Contact me at john@example.com") print(tokenized_text) # "Contact me at <Email Address_1>" # Check the stored mapping print(tokenizer.get_mapping()) # {"<Email Address_1>": "john@example.com"}
The BlindfoldDetokenizer takes a reference to the tokenizer and reads its mapping to perform the reverse replacement. This is done locally — no additional API call is made for detokenization. The mapping is automatically cleared after each detokenization to prevent data leaking between requests.
Protecting Documents in RAG Pipelines
Chains are not the only place where PII leaks. In Retrieval-Augmented Generation (RAG) pipelines, documents loaded from databases, PDFs, or APIs often contain personal data. When those documents are embedded and stored in a vector store, the PII gets baked into your index. When they are retrieved and injected into prompts, the PII goes to the LLM.
The BlindfoldPIITransformer solves this by protecting PII in LangChain Document objects before they go into the vector store. It implements LangChain's BaseDocumentTransformer interface, so it drops in anywhere you use document transformers today.
from langchain_blindfold import BlindfoldPIITransformer from langchain_core.documents import Document # Create a transformer that tokenizes PII in documents transformer = BlindfoldPIITransformer( pii_method="tokenize", policy="basic", ) # Your documents from any loader docs = [ Document( page_content="Customer Sarah Chen (sarah.chen@acme.com) reported " "an outage on Feb 10. Her phone: +1-555-234-5678.", metadata={"source": "tickets.csv", "row": 42}, ), Document( page_content="Order #9912 for James Rivera, 742 Evergreen Terrace, " "Springfield IL 62704. Payment via card ending 4821.", metadata={"source": "orders.csv", "row": 108}, ), ] # Transform: PII in page_content is replaced with tokens safe_docs = transformer.transform_documents(docs) for doc in safe_docs: print(doc.page_content) print(doc.metadata) print() # Output: # Customer <Person_1> (<Email Address_1>) reported an outage on Feb 10. # Her phone: <Phone Number_1>. # {"source": "tickets.csv", "row": 42, "blindfold_mapping": {"<Person_1>": "Sarah Chen", ...}} # # Order #9912 for <Person_1>, <Address_1>. # Payment via card ending <Credit Card Number_1>. # {"source": "orders.csv", "row": 108, "blindfold_mapping": {"<Person_1>": "James Rivera", ...}}
When you use the tokenize method, the mapping between tokens and original values is stored in each document's metadata["blindfold_mapping"]. This means you can reverse the tokenization later when you retrieve documents and need to display original values to authorized users.
Using it in a RAG ingestion pipeline
Here is how BlindfoldPIITransformer fits into a typical RAG ingestion flow:
from langchain_community.document_loaders import CSVLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain_blindfold import BlindfoldPIITransformer # 1. Load documents loader = CSVLoader("support_tickets.csv") docs = loader.load() # 2. Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000) chunks = splitter.split_documents(docs) # 3. Protect PII before indexing transformer = BlindfoldPIITransformer( pii_method="tokenize", policy="basic", ) safe_chunks = transformer.transform_documents(chunks) # 4. Index into vector store — no PII in the embeddings vectorstore = Chroma.from_documents( documents=safe_chunks, embedding=OpenAIEmbeddings(), )
Now your vector store contains tokenized text. When documents are retrieved during a query, the context injected into the LLM prompt is already PII-free. You get accurate semantic search (the embeddings still capture the meaning of the text) without any personal data leaking into your LLM calls.
Other protection methods
The pii_method parameter controls how PII is handled. Depending on your use case, you might not need the mapping at all:
tokenize— Replace with reversible tokens. Mapping stored in metadata.redact— Replace with[REDACTED]. Irreversible. Good for analytics pipelines where you never need the originals.mask— Partially mask values (e.g.,S***h C**n). Useful when you need to keep some context visible.synthesize— Replace with realistic fake data. Great for testing and development environments.encrypt— Replace with encrypted values. Reversible only with your key.
Compliance Policies
Different regulations require you to detect different types of personal data. Blindfold ships with pre-configured detection policies that target the entity types relevant to specific compliance frameworks. You set the policy once, and the right entities are detected automatically.
| Policy | Use case | Key entity types |
|---|---|---|
basic | General-purpose PII protection | Person, Email Address, Phone Number, Address, Credit Card Number |
gdpr_eu | EU data protection (GDPR) | Person, Email Address, Phone Number, Address, Iban Code, Date Of Birth, National ID |
hipaa_us | US healthcare (HIPAA Safe Harbor) | All 18 HIPAA identifiers: Person, SSN, Medical Record Number, Date Of Birth, Address, Phone Number, and more |
pci_dss | Payment card data protection | Credit Card Number, CVV, Expiration Date, Cardholder Name |
strict | Maximum detection coverage | All supported entity types with the lowest threshold |
Using policies in your LangChain chain is straightforward — just pass the policy name to blindfold_protect() or to the individual runnables:
from langchain_blindfold import blindfold_protect # For a healthcare chatbot tokenize, detokenize = blindfold_protect(policy="hipaa_us", region="us") chain = tokenize | prompt | llm | StrOutputParser() | detokenize # For an EU customer support agent tokenize, detokenize = blindfold_protect(policy="gdpr_eu", region="eu") chain = tokenize | prompt | llm | StrOutputParser() | detokenize # For a payment processing assistant tokenize, detokenize = blindfold_protect(policy="pci_dss") chain = tokenize | prompt | llm | StrOutputParser() | detokenize
EU and US Data Residency
Where your PII is processed matters. GDPR restricts cross-border transfers of EU personal data. HIPAA compliance officers typically require US-only processing. Blindfold offers regional API endpoints so you can guarantee data residency:
region="eu"— Routes all API calls toeu-api.blindfold.dev. PII is processed and stored exclusively within the EU.region="us"— Routes all API calls tous-api.blindfold.dev. PII is processed and stored exclusively within the US.
Set the region when creating the tokenizer or the document transformer:
# Chain with EU data residency tokenize, detokenize = blindfold_protect( policy="gdpr_eu", region="eu", ) # Document transformer with EU data residency transformer = BlindfoldPIITransformer( pii_method="tokenize", policy="gdpr_eu", region="eu", )
Important: The region setting controls where Blindfold processes the PII detection and tokenization. The tokenized text that you then send to your LLM provider contains no personal data, so the LLM call itself does not constitute a restricted data transfer — even if the provider operates from a different jurisdiction.
Complete Example: Protected RAG Chatbot
Here is a full example that puts everything together — a RAG chatbot with PII protection at both the ingestion and query layers:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_chroma import Chroma from langchain_blindfold import blindfold_protect, BlindfoldPIITransformer # --- Ingestion: protect documents before indexing --- transformer = BlindfoldPIITransformer( pii_method="tokenize", policy="basic", ) # Assume 'raw_docs' is loaded from your data source safe_docs = transformer.transform_documents(raw_docs) vectorstore = Chroma.from_documents( documents=safe_docs, embedding=OpenAIEmbeddings(), ) retriever = vectorstore.as_retriever() # --- Query: protect user input, retrieve, answer --- tokenize, detokenize = blindfold_protect(policy="basic") prompt = ChatPromptTemplate.from_messages([ ("system", "Answer the question using only the context below.\n\nContext:\n{context}"), ("human", "{question}"), ]) llm = ChatOpenAI(model="gpt-4o") def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) chain = ( tokenize | {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() | detokenize ) # Ask a question with PII — it's protected end-to-end answer = chain.invoke( "What was the resolution for Sarah Chen's outage ticket?" ) print(answer)
In this pipeline, PII is protected at two levels. The documents in the vector store have already been tokenized during ingestion. And the user's question is tokenized before it reaches the retriever and LLM. The final answer is detokenized so the user sees natural language with real names and details. At no point does your LLM provider see any real personal data.
Multi-Language Support
Blindfold's PII detection works across 18+ languages, including English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Czech, Slovak, Danish, Swedish, Norwegian, and Romanian. This is especially useful if your LangChain application serves international users or processes multilingual documents. No language parameter is needed — detection is automatic.
Try It Yourself
Getting started takes less than a minute:
pip install langchain-blindfold blindfold-sdk
The free tier includes 1M characters per month, which is enough to test thoroughly and run lightweight production workloads. No credit card required.
- GitHub: blindfold-dev/langchain-blindfold — Source code, API reference, and examples
- PyPI: langchain-blindfold — Install with
pip install langchain-blindfold - Blindfold Documentation — Full API docs, SDK guides, and compliance resources
- Sign up for free — Get your API key and start protecting PII in your LangChain pipelines
Already using LangChain? Adding PII protection does not require restructuring your chains. The tokenizer and detokenizer are standard LangChain Runnables that compose with the pipe operator. Your existing prompt templates, LLM configurations, and output parsers stay exactly the same.
Start protecting sensitive data
Free plan includes 500K characters/month. No credit card required.