Protecting PII in LangChain RAG Pipelines
Use BlindfoldPIITransformer and blindfold_protect() to add PII protection to LangChain RAG pipelines. Covers document ingestion, retrieval chains, and compliance policy recommendations.
LangChain is the most popular framework for building RAG pipelines. It makes it easy to load documents, split them, embed them, and retrieve them at query time. The langchain-blindfold package provides native integration through BlindfoldPIITransformer for selectively redacting documents at ingestion time.
The key insight for RAG pipelines is that not all PII needs the same treatment. Names should stay in documents so users can search for them by name, while contact details like email addresses and phone numbers should be removed. At query time, an explicit retrieve-then-tokenize flow protects both the retrieved context and the user's question in a single tokenization call before they reach the LLM.
This post walks through the corrected architecture: selective redaction at ingestion with the entities parameter, and a retrieve-then-tokenize pattern at query time that keeps token numbering consistent across context and question.
The Document Ingestion Pipeline
A typical LangChain RAG ingestion pipeline has four steps: load documents, split them into chunks, embed them, and store in a vector store. With Blindfold, you add one step between splitting and embedding — selectively redact contact information while keeping names intact for searchability:
from langchain_blindfold import BlindfoldPIITransformer from langchain_community.document_loaders import CSVLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS # 1. Load documents loader = CSVLoader("support_tickets.csv") docs = loader.load() # 2. Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=1000) chunks = splitter.split_documents(docs) # 3. Selectively redact contact info — keep names for search transformer = BlindfoldPIITransformer( pii_method="redact", policy="basic", entities=["email address", "phone number"], ) safe_chunks = transformer.transform_documents(chunks) # 4. Index into vector store — names preserved, contact info removed vectorstore = FAISS.from_documents( documents=safe_chunks, embedding=OpenAIEmbeddings(), )
The entities parameter tells Blindfold to only redact the specified entity types. By targeting email addresses and phone numbers, names like "Sarah Chen" stay in the document text and the resulting embeddings. This means that when a user asks "What was Sarah Chen's issue?", the vector search finds the right chunks because the name still appears.
The BlindfoldPIITransformer implements LangChain's BaseDocumentTransformer interface. It processes each document's page_content, redacts only the targeted entities, and preserves all existing metadata. It drops in anywhere you use document transformers today.
Tokenize vs. Redact for Documents
The pii_method parameter controls how PII is handled. The two most common options for RAG are:
Selective Redact (recommended for most cases)
Permanently removes only the specified entity types. Pass the entities parameter to target contact information while keeping names in the text. This is the best balance for RAG — names remain searchable, but emails and phone numbers are stripped from the vector store with no way to recover them.
transformer = BlindfoldPIITransformer( pii_method="redact", entities=["email address", "phone number"], ) safe_docs = transformer.transform_documents(docs) # page_content: "Customer Sarah Chen ([EMAIL_ADDRESS]) reported..." # metadata: {"source": "tickets.csv", "row": 42} (no mapping)
Tokenize (when you need reversibility)
Replaces PII with reversible tokens and stores the mapping in document metadata. Useful when you need to restore original values later — for example, showing authorized users the real data from retrieved documents.
transformer = BlindfoldPIITransformer(pii_method="tokenize") safe_docs = transformer.transform_documents(docs) # page_content: "Customer <Person_1> (<Email Address_1>) reported..." # metadata: { # "source": "tickets.csv", # "row": 42, # "blindfold_mapping": {"<Person_1>": "Sarah Chen", ...} # }
Recommendation: Use selective redact with the entities parameter for most RAG pipelines. It keeps names searchable in the vector store while permanently stripping contact information. No mapping to manage, no way to reverse the transformation for the redacted entities.
The Retrieval Chain
For the query side, use an explicit retrieve-then-tokenize flow. The key idea: search the vector store with the original question (so names match), then tokenize the retrieved context and question together in a single call before sending them to the LLM.
from blindfold import Blindfold from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser bf = Blindfold() def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) question = "What was Sarah Chen's billing issue?" # 1. Search with the original question — names match in the vector store docs = retriever.invoke(question) context = format_docs(docs) # 2. Tokenize context + question together in a single call # This ensures consistent token numbering (e.g. <Person_1> = same person everywhere) combined = f"Context:\n{context}\n\nQuestion: {question}" result = bf.tokenize(combined) safe_text = result.tokenized_text mapping = result.mapping # 3. Split back and send to LLM safe_context, safe_question = safe_text.split("\n\nQuestion: ", 1) safe_context = safe_context.removeprefix("Context:\n") prompt = ChatPromptTemplate.from_messages([ ("system", "Answer using this context:\n{context}"), ("human", "{question}"), ]) llm = ChatOpenAI(model="gpt-4o") chain = prompt | llm | StrOutputParser() tokenized_answer = chain.invoke({"context": safe_context, "question": safe_question}) # 4. Detokenize to restore original names answer = bf.detokenize(tokenized_answer, mapping)
Here is what happens at each step:
- Retrieve with original question. The user's question goes to the retriever as-is. Because names were preserved during ingestion, "Sarah Chen" matches the vector store content and the right chunks are returned.
- Tokenize context + question together. The retrieved context and the question are combined into a single string and tokenized in one call. This is critical — if you tokenized them separately, each call would produce independent token numbering (both might use
<Person_1>for different people). A single call ensures that<Person_1>refers to the same person across both the context and the question. - LLM call. The LLM receives tokenized context and a tokenized question. It generates a response using tokens as placeholders, never seeing real PII.
- Detokenize. The stored mapping is used to replace tokens with original values in the LLM's response, producing a natural-language answer for the user.
Why a single tokenize call? Each call to tokenize() starts numbering tokens from 1. If you tokenize the context and question separately, the context might map <Person_1> to "Sarah Chen" while the question maps <Person_1> to "James Rivera". The LLM would then confuse the two people. Combining them into one string before tokenizing avoids this entirely.
Complete RAG Pipeline
Here is the full end-to-end example combining ingestion and query:
from blindfold import Blindfold from langchain_blindfold import BlindfoldPIITransformer from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.documents import Document from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_community.vectorstores import FAISS from langchain_text_splitters import RecursiveCharacterTextSplitter # --- Ingestion: selectively redact contact info, keep names --- docs = [ Document(page_content="Customer Sarah Chen (sarah@acme.com) reported a billing error."), Document(page_content="James Rivera had API timeouts. Root cause: DNS misconfiguration."), Document(page_content="Maria Garcia (maria@example.es) requested a GDPR data export."), ] splitter = RecursiveCharacterTextSplitter(chunk_size=500) chunks = splitter.split_documents(docs) transformer = BlindfoldPIITransformer( pii_method="redact", policy="basic", entities=["email address", "phone number"], ) safe_chunks = transformer.transform_documents(chunks) # "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing error." vectorstore = FAISS.from_documents(safe_chunks, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # --- Query: retrieve then tokenize --- bf = Blindfold() def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) question = "What was Sarah Chen's issue?" # 1. Search with original question — "Sarah Chen" matches the vector store docs = retriever.invoke(question) context = format_docs(docs) # 2. Tokenize context + question together (consistent token numbering) combined = f"Context:\n{context}\n\nQuestion: {question}" result = bf.tokenize(combined) safe_context, safe_question = result.tokenized_text.split("\n\nQuestion: ", 1) safe_context = safe_context.removeprefix("Context:\n") # 3. LLM call with tokenized inputs prompt = ChatPromptTemplate.from_messages([ ("system", "Answer using this context:\n{context}"), ("human", "{question}"), ]) chain = prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser() tokenized_answer = chain.invoke({"context": safe_context, "question": safe_question}) # 4. Detokenize to restore original names answer = bf.detokenize(tokenized_answer, result.mapping) print(answer) # "Sarah Chen reported a billing error on her account."
Security Trade-offs
There are three approaches to handling PII at ingestion time, each with different trade-offs between privacy, searchability, and complexity:
Selective Redaction (recommended)
Keep names in the documents for searchability, redact contact information like email addresses and phone numbers. Users can search by name and get accurate retrieval results. Contact details are permanently removed with no way to recover them.
transformer = BlindfoldPIITransformer( pii_method="redact", entities=["email address", "phone number"], ) # "Sarah Chen ([EMAIL_ADDRESS]) reported..." — name stays, email gone
Full Redaction
Redact all PII including names. This provides the strongest privacy guarantee — no personal data of any kind in the vector store. The trade-off is that name-based searches no longer work. Users would need to search by topic or keywords instead.
transformer = BlindfoldPIITransformer( pii_method="redact", ) # "[PERSON] ([EMAIL_ADDRESS]) reported..." — all PII removed
Tokenize with Stored Mapping (advanced)
Replace all PII with reversible tokens and store the mapping. This gives full privacy in the vector store (no real PII in embeddings) while allowing authorized processes to restore original values when needed. The trade-off is added complexity — you need to manage and secure the token mappings, and name-based vector search will not work because the embeddings contain tokens rather than real names.
transformer = BlindfoldPIITransformer( pii_method="tokenize", ) # "<Person_1> (<Email Address_1>) reported..." — reversible, mapping in metadata
Which to choose? For most RAG pipelines, selective redaction is the best starting point. It preserves the search experience users expect (asking about a person by name) while removing the contact details that create the most compliance risk. Move to full redaction only if your compliance requirements prohibit storing names in the vector store.
Policy Recommendations
Match the compliance policy to your RAG use case:
| Use case | Policy | Region | Key entities |
|---|---|---|---|
| General knowledge base | basic | — | Names, emails, phones, addresses, credit cards |
| EU customer data | gdpr_eu | eu | Names, IBANs, national IDs, DOB, addresses |
| Healthcare documents | hipaa_us | us | All 18 HIPAA identifiers (SSN, MRN, DOB, etc.) |
| Payment records | pci_dss | — | Credit cards, CVVs, expiration dates |
# EU customer support RAG — selective redaction + tokenize at query time transformer = BlindfoldPIITransformer( pii_method="redact", policy="gdpr_eu", region="eu", entities=["email address", "phone number", "IBAN"], ) bf = Blindfold(policy="gdpr_eu", region="eu") # Healthcare RAG — full redaction (HIPAA requires removing all 18 identifiers) transformer = BlindfoldPIITransformer( pii_method="redact", policy="hipaa_us", region="us", ) bf = Blindfold(policy="hipaa_us", region="us")
Try It Yourself
Get started with PII-safe LangChain RAG:
pip install langchain-blindfold blindfold-sdk langchain-openai faiss-cpu
- LangChain RAG Cookbook Example (Python) — complete, runnable example with FAISS
- LangChain RAG Cookbook Example (TypeScript) — same pattern in LangChain.js
- PyPI: langchain-blindfold — install with
pip install langchain-blindfold - RAG Pipeline Protection Guide — full documentation with Python, JavaScript, and LangChain examples
- Sign up for free — 500K characters per month, no credit card required
Already using LangChain for RAG? Adding PII protection requires minimal changes. The BlindfoldPIITransformer is a standard document transformer that slots into your existing ingestion pipeline. At query time, the retrieve-then-tokenize pattern adds a few lines around your existing retriever and LLM calls. Your loaders, splitters, embeddings, and retriever configurations stay exactly the same.
Start protecting sensitive data
Free plan includes 500K characters/month. No credit card required.