← Back to blog
LangChainFebruary 25, 20269 min read

Protecting PII in LangChain RAG Pipelines

Use BlindfoldPIITransformer and blindfold_protect() to add PII protection to LangChain RAG pipelines. Covers document ingestion, retrieval chains, and compliance policy recommendations.

LangChain is the most popular framework for building RAG pipelines. It makes it easy to load documents, split them, embed them, and retrieve them at query time. The langchain-blindfold package provides native integration through BlindfoldPIITransformer for selectively redacting documents at ingestion time.

The key insight for RAG pipelines is that not all PII needs the same treatment. Names should stay in documents so users can search for them by name, while contact details like email addresses and phone numbers should be removed. At query time, an explicit retrieve-then-tokenize flow protects both the retrieved context and the user's question in a single tokenization call before they reach the LLM.

This post walks through the corrected architecture: selective redaction at ingestion with the entities parameter, and a retrieve-then-tokenize pattern at query time that keeps token numbering consistent across context and question.

The Document Ingestion Pipeline

A typical LangChain RAG ingestion pipeline has four steps: load documents, split them into chunks, embed them, and store in a vector store. With Blindfold, you add one step between splitting and embedding — selectively redact contact information while keeping names intact for searchability:

python
from langchain_blindfold import BlindfoldPIITransformer
from langchain_community.document_loaders import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Load documents
loader = CSVLoader("support_tickets.csv")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)

# 3. Selectively redact contact info — keep names for search
transformer = BlindfoldPIITransformer(
    pii_method="redact",
    policy="basic",
    entities=["email address", "phone number"],
)
safe_chunks = transformer.transform_documents(chunks)

# 4. Index into vector store — names preserved, contact info removed
vectorstore = FAISS.from_documents(
    documents=safe_chunks,
    embedding=OpenAIEmbeddings(),
)

The entities parameter tells Blindfold to only redact the specified entity types. By targeting email addresses and phone numbers, names like "Sarah Chen" stay in the document text and the resulting embeddings. This means that when a user asks "What was Sarah Chen's issue?", the vector search finds the right chunks because the name still appears.

The BlindfoldPIITransformer implements LangChain's BaseDocumentTransformer interface. It processes each document's page_content, redacts only the targeted entities, and preserves all existing metadata. It drops in anywhere you use document transformers today.

Tokenize vs. Redact for Documents

The pii_method parameter controls how PII is handled. The two most common options for RAG are:

Selective Redact (recommended for most cases)

Permanently removes only the specified entity types. Pass the entities parameter to target contact information while keeping names in the text. This is the best balance for RAG — names remain searchable, but emails and phone numbers are stripped from the vector store with no way to recover them.

python
transformer = BlindfoldPIITransformer(
    pii_method="redact",
    entities=["email address", "phone number"],
)
safe_docs = transformer.transform_documents(docs)

# page_content: "Customer Sarah Chen ([EMAIL_ADDRESS]) reported..."
# metadata: {"source": "tickets.csv", "row": 42}  (no mapping)

Tokenize (when you need reversibility)

Replaces PII with reversible tokens and stores the mapping in document metadata. Useful when you need to restore original values later — for example, showing authorized users the real data from retrieved documents.

python
transformer = BlindfoldPIITransformer(pii_method="tokenize")
safe_docs = transformer.transform_documents(docs)

# page_content: "Customer <Person_1> (<Email Address_1>) reported..."
# metadata: {
#   "source": "tickets.csv",
#   "row": 42,
#   "blindfold_mapping": {"<Person_1>": "Sarah Chen", ...}
# }

Recommendation: Use selective redact with the entities parameter for most RAG pipelines. It keeps names searchable in the vector store while permanently stripping contact information. No mapping to manage, no way to reverse the transformation for the redacted entities.

The Retrieval Chain

For the query side, use an explicit retrieve-then-tokenize flow. The key idea: search the vector store with the original question (so names match), then tokenize the retrieved context and question together in a single call before sending them to the LLM.

python
from blindfold import Blindfold
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

bf = Blindfold()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

question = "What was Sarah Chen's billing issue?"

# 1. Search with the original question — names match in the vector store
docs = retriever.invoke(question)
context = format_docs(docs)

# 2. Tokenize context + question together in a single call
#    This ensures consistent token numbering (e.g. <Person_1> = same person everywhere)
combined = f"Context:\n{context}\n\nQuestion: {question}"
result = bf.tokenize(combined)
safe_text = result.tokenized_text
mapping = result.mapping

# 3. Split back and send to LLM
safe_context, safe_question = safe_text.split("\n\nQuestion: ", 1)
safe_context = safe_context.removeprefix("Context:\n")

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using this context:\n{context}"),
    ("human", "{question}"),
])

llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm | StrOutputParser()
tokenized_answer = chain.invoke({"context": safe_context, "question": safe_question})

# 4. Detokenize to restore original names
answer = bf.detokenize(tokenized_answer, mapping)

Here is what happens at each step:

  1. Retrieve with original question. The user's question goes to the retriever as-is. Because names were preserved during ingestion, "Sarah Chen" matches the vector store content and the right chunks are returned.
  2. Tokenize context + question together. The retrieved context and the question are combined into a single string and tokenized in one call. This is critical — if you tokenized them separately, each call would produce independent token numbering (both might use <Person_1> for different people). A single call ensures that <Person_1> refers to the same person across both the context and the question.
  3. LLM call. The LLM receives tokenized context and a tokenized question. It generates a response using tokens as placeholders, never seeing real PII.
  4. Detokenize. The stored mapping is used to replace tokens with original values in the LLM's response, producing a natural-language answer for the user.

Why a single tokenize call? Each call to tokenize() starts numbering tokens from 1. If you tokenize the context and question separately, the context might map <Person_1> to "Sarah Chen" while the question maps <Person_1> to "James Rivera". The LLM would then confuse the two people. Combining them into one string before tokenizing avoids this entirely.

Complete RAG Pipeline

Here is the full end-to-end example combining ingestion and query:

python
from blindfold import Blindfold
from langchain_blindfold import BlindfoldPIITransformer
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

# --- Ingestion: selectively redact contact info, keep names ---
docs = [
    Document(page_content="Customer Sarah Chen (sarah@acme.com) reported a billing error."),
    Document(page_content="James Rivera had API timeouts. Root cause: DNS misconfiguration."),
    Document(page_content="Maria Garcia (maria@example.es) requested a GDPR data export."),
]

splitter = RecursiveCharacterTextSplitter(chunk_size=500)
chunks = splitter.split_documents(docs)

transformer = BlindfoldPIITransformer(
    pii_method="redact",
    policy="basic",
    entities=["email address", "phone number"],
)
safe_chunks = transformer.transform_documents(chunks)
# "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing error."

vectorstore = FAISS.from_documents(safe_chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# --- Query: retrieve then tokenize ---
bf = Blindfold()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

question = "What was Sarah Chen's issue?"

# 1. Search with original question — "Sarah Chen" matches the vector store
docs = retriever.invoke(question)
context = format_docs(docs)

# 2. Tokenize context + question together (consistent token numbering)
combined = f"Context:\n{context}\n\nQuestion: {question}"
result = bf.tokenize(combined)
safe_context, safe_question = result.tokenized_text.split("\n\nQuestion: ", 1)
safe_context = safe_context.removeprefix("Context:\n")

# 3. LLM call with tokenized inputs
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using this context:\n{context}"),
    ("human", "{question}"),
])
chain = prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()
tokenized_answer = chain.invoke({"context": safe_context, "question": safe_question})

# 4. Detokenize to restore original names
answer = bf.detokenize(tokenized_answer, result.mapping)
print(answer)
# "Sarah Chen reported a billing error on her account."

Security Trade-offs

There are three approaches to handling PII at ingestion time, each with different trade-offs between privacy, searchability, and complexity:

Selective Redaction (recommended)

Keep names in the documents for searchability, redact contact information like email addresses and phone numbers. Users can search by name and get accurate retrieval results. Contact details are permanently removed with no way to recover them.

python
transformer = BlindfoldPIITransformer(
    pii_method="redact",
    entities=["email address", "phone number"],
)
# "Sarah Chen ([EMAIL_ADDRESS]) reported..." — name stays, email gone

Full Redaction

Redact all PII including names. This provides the strongest privacy guarantee — no personal data of any kind in the vector store. The trade-off is that name-based searches no longer work. Users would need to search by topic or keywords instead.

python
transformer = BlindfoldPIITransformer(
    pii_method="redact",
)
# "[PERSON] ([EMAIL_ADDRESS]) reported..." — all PII removed

Tokenize with Stored Mapping (advanced)

Replace all PII with reversible tokens and store the mapping. This gives full privacy in the vector store (no real PII in embeddings) while allowing authorized processes to restore original values when needed. The trade-off is added complexity — you need to manage and secure the token mappings, and name-based vector search will not work because the embeddings contain tokens rather than real names.

python
transformer = BlindfoldPIITransformer(
    pii_method="tokenize",
)
# "<Person_1> (<Email Address_1>) reported..." — reversible, mapping in metadata

Which to choose? For most RAG pipelines, selective redaction is the best starting point. It preserves the search experience users expect (asking about a person by name) while removing the contact details that create the most compliance risk. Move to full redaction only if your compliance requirements prohibit storing names in the vector store.

Policy Recommendations

Match the compliance policy to your RAG use case:

Use casePolicyRegionKey entities
General knowledge basebasicNames, emails, phones, addresses, credit cards
EU customer datagdpr_eueuNames, IBANs, national IDs, DOB, addresses
Healthcare documentshipaa_ususAll 18 HIPAA identifiers (SSN, MRN, DOB, etc.)
Payment recordspci_dssCredit cards, CVVs, expiration dates
python
# EU customer support RAG — selective redaction + tokenize at query time
transformer = BlindfoldPIITransformer(
    pii_method="redact",
    policy="gdpr_eu",
    region="eu",
    entities=["email address", "phone number", "IBAN"],
)
bf = Blindfold(policy="gdpr_eu", region="eu")

# Healthcare RAG — full redaction (HIPAA requires removing all 18 identifiers)
transformer = BlindfoldPIITransformer(
    pii_method="redact",
    policy="hipaa_us",
    region="us",
)
bf = Blindfold(policy="hipaa_us", region="us")

Try It Yourself

Get started with PII-safe LangChain RAG:

bash
pip install langchain-blindfold blindfold-sdk langchain-openai faiss-cpu

Already using LangChain for RAG? Adding PII protection requires minimal changes. The BlindfoldPIITransformer is a standard document transformer that slots into your existing ingestion pipeline. At query time, the retrieve-then-tokenize pattern adds a few lines around your existing retriever and LLM calls. Your loaders, splitters, embeddings, and retriever configurations stay exactly the same.

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.