← Back to blog
RAGFebruary 25, 202610 min read

How to Build a PII-Safe RAG Pipeline

RAG pipelines are the #1 pattern where PII leaks into LLMs. Learn how to protect personal data with two-layer privacy: ingestion-time redaction and query-time tokenization using ChromaDB, OpenAI, and Blindfold.

Retrieval-Augmented Generation has become the standard pattern for building AI applications that answer questions using your own data. Connect a vector database to an LLM, retrieve relevant documents, and let the model synthesize an answer. The results are impressive — but there is a problem hiding in every retrieval step: your documents are full of personally identifiable information, and that PII flows straight to your LLM provider.

The solution is not to blindly strip every piece of personal data from your knowledge base. Names, for example, are critical for search relevance — if a customer asks about “Sarah Chen's billing issue,” you need the name to exist in your vector store so the retriever can find the right document. The real danger is contact information (emails, phone numbers, SSNs) that has no business leaving your infrastructure.

This article shows you the corrected architecture for PII-safe RAG: selective redaction at ingestion time (removing contact details while keeping names for searchability) combined with a search-first query flow that uses a single tokenize call on the combined context and question before sending anything to the LLM. The result is a pipeline where the LLM never sees a single real email address, phone number, or SSN — yet search remains accurate and responses are fully personalized.

The Hidden PII Problem in RAG

In a standard chatbot, you control the input surface. You can scan the user's message before sending it to the LLM and strip out anything sensitive. But RAG changes the equation fundamentally. The retrieval step pulls data from your knowledge base and injects it into the prompt — without any user action and without any obvious place to intervene.

Consider what happens when a customer asks “Why was I charged twice?” in a support chatbot. The retriever searches your vector database, finds the three most relevant support tickets, and adds them to the prompt as context. Those tickets contain the customer's name, email address, account number, and possibly credit card details. All of that data is now in the API request to your LLM provider.

There are three distinct attack surfaces in a RAG pipeline where PII can leak:

  1. Vector database storage: Your embeddings are generated from raw documents that contain PII. Even if embeddings themselves are not directly reversible, the original text chunks stored alongside them are. A breach of your vector database exposes every piece of customer data you ever ingested.
  2. Retrieval-to-prompt injection: Retrieved documents get concatenated into the prompt. This is the most dangerous surface because it happens automatically. Every query can pull in PII from multiple unrelated documents, creating a data aggregation risk that is hard to audit.
  3. LLM provider logging: Your provider receives the full prompt, including all retrieved context. Even providers with strong data policies may retain logs temporarily for abuse detection or debugging. Under GDPR, this constitutes a data transfer to a third party — and you need a legal basis for it.

Solving only one of these surfaces is not enough. You need protection at both ingestion time and query time to fully close the gap.

Two Protection Layers

The key insight is that a RAG pipeline has two distinct data flows, and each requires a different protection strategy. At ingestion time, documents flow into your vector database. At query time, retrieved context and the user's question flow to the LLM. Blindfold provides two operations that map to these flows: redact for selective ingestion cleanup, and tokenize for reversible protection before the LLM sees the prompt.

Layer 1: Selective Ingestion-Time Redaction

Before you embed and store any document, strip out the contact information that poses the highest risk — email addresses, phone numbers, SSNs — while keeping names intact for searchability. The entities parameter on blindfold.redact() gives you this control. By specifying exactly which entity types to redact, you remove dangerous contact data while preserving the names and context that make vector search effective.

The ingestion flow looks like this: raw documents are split into chunks, each chunk is selectively redacted (contact info removed, names preserved), and then the cleaned chunks are embedded and stored.

python
import blindfold
from langchain.text_splitter import RecursiveCharacterTextSplitter
import chromadb

# Initialize clients
bf = blindfold.Blindfold(api_key="your-blindfold-api-key")
chroma = chromadb.PersistentClient(path="./vectorstore")
collection = chroma.get_or_create_collection("support_tickets")

# Raw documents with PII
documents = [
    "Customer Sarah Chen (sarah.chen@acme.com) reported a billing issue on 2025-12-01. Account 4829-1038-2847 was charged twice for $49.99.",
    "John Martinez called on 2025-12-03 about a refund. Phone: 555-0142. SSN provided for verification: 412-55-6789.",
    "Support ticket from jane.doe@example.org: Unable to access account since password reset. IP address 192.168.1.42 flagged.",
]

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = []
for doc in documents:
    chunks.extend(splitter.split_text(doc))

# Selectively redact contact info — keep names for searchability
for i, chunk in enumerate(chunks):
    result = bf.redact(chunk, entities=["email address", "phone number", "us social security number"])
    clean_text = result.text
    # "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing issue..."
    # Names preserved — emails, phones, SSNs removed
    collection.add(
        documents=[clean_text],
        ids=[f"chunk_{i}"],
    )

After ingestion, the vector database contains text like "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing issue..." instead of real email addresses and phone numbers. Names are still present, so a search for “Sarah Chen billing” still returns the correct document. But the sensitive contact data is gone permanently.

Layer 2: Search-First Query-Time Tokenization

At query time, the goal is to search with the original question (so the retriever can match names), then protect everything before the LLM sees it. The critical detail is that you must use a single tokenize call on the combined context and question — not separate calls for each.

Why does a single call matter? Each tokenize call produces its own independent token numbering. If you tokenize the question separately from the context, the same name “Sarah Chen” could become PERSON_1 in the question but PERSON_1 in the context could map to a completely different name. The LLM would see conflicting mappings, and detokenization would produce garbled output. A single call ensures consistent token assignment across the entire prompt.

The query flow is: search with the original question, combine retrieved context with the question into a single string, tokenize that combined string once, send the tokenized prompt to the LLM, then detokenize the response.

python
# User asks a question containing PII
user_query = "My name is Sarah Chen and my email is sarah.chen@acme.com. Why was I charged twice?"

# Step 1: Search with the ORIGINAL question (names preserved in vector store)
results = collection.query(query_texts=[user_query], n_results=3)
context = "\n".join(results["documents"][0])

# Step 2: Combine context + question into a single string
combined_prompt = f"Context:\n{context}\n\nQuestion: {user_query}"

# Step 3: Single tokenize call on the combined text
# This ensures consistent token assignment across context and question
tok = bf.tokenize(combined_prompt)
safe_prompt = tok.text
# "Context:\nCustomer PERSON_1 ([EMAIL_ADDRESS]) reported...\n\nQuestion: My name is PERSON_1 and my email is EMAIL_1..."

# Step 4: Send tokenized prompt to LLM — no real PII
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer using the provided context."},
        {"role": "user", "content": safe_prompt},
    ],
)

# Step 5: Detokenize the response to restore real names
llm_answer = response.choices[0].message.content
final_answer = bf.detokenize(llm_answer, tok.mapping).text
# "Sarah Chen, I can see your account was charged twice. I've initiated a refund to sarah.chen@acme.com."

The key point: the LLM receives a prompt where every name and email is replaced with consistent tokens like PERSON_1 and EMAIL_1. Because both the context and question were tokenized in a single call, the same real-world entity always maps to the same token. The LLM can reason about “PERSON_1” across both the question and context correctly, and detokenization restores everything at the end.

Full Pipeline Example

Here is a complete, working example that puts both layers together. It ingests documents with selective redaction, then handles queries with the search-first, single-tokenize approach. You can copy this and run it directly.

python
import blindfold
import chromadb
from openai import OpenAI

# --- Setup ---
bf = blindfold.Blindfold(api_key="your-blindfold-api-key")
openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./vectorstore")
collection = chroma.get_or_create_collection("knowledge_base")

# --- Layer 1: Ingest with selective redaction ---
# Remove contact info but keep names for search relevance
raw_docs = [
    "Customer Sarah Chen (sarah.chen@acme.com) reported a billing error on account 4829-1038-2847.",
    "John Martinez (SSN 412-55-6789) requested a refund. Phone: 555-0142.",
    "Support ticket from jane.doe@example.org: login issues after password reset.",
]

for i, doc in enumerate(raw_docs):
    redacted = bf.redact(doc, entities=["email address", "phone number", "us social security number"])
    collection.add(documents=[redacted.text], ids=[f"doc_{i}"])

# --- Layer 2: Query with search-first + single tokenize ---
def ask(question: str) -> str:
    # Search with original question (names in store enable accurate retrieval)
    hits = collection.query(query_texts=[question], n_results=3)
    context = "\n".join(hits["documents"][0])

    # Combine context + question, then tokenize once
    combined = f"Context:\n{context}\n\nQuestion: {question}"
    tok = bf.tokenize(combined)

    # Call LLM with protected prompt
    resp = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using the context provided."},
            {"role": "user", "content": tok.text},
        ],
    )

    # Detokenize to restore real names in the response
    return bf.detokenize(resp.choices[0].message.content, tok.mapping).text

# --- Usage ---
answer = ask("I'm Sarah Chen (sarah.chen@acme.com). Why was I charged twice?")
print(answer)
# "Sarah Chen, I found a billing error on your account. A duplicate charge was detected and a refund has been initiated. Confirmation will be sent to sarah.chen@acme.com."

Why a single tokenize call matters: If you tokenize the context and question separately, each call produces independent token numbering. The name “Sarah Chen” might become PERSON_1 in the context but a different PERSON_1 in the question could map to someone else entirely. The LLM sees conflicting identities and detokenization produces garbled output. A single call on the combined string guarantees that the same real-world entity always maps to the same token everywhere in the prompt.

Protection Strategy Comparison

Blindfold supports several protection methods. Choosing the right one depends on whether you need to recover the original data and what kind of output is acceptable for your use case.

MethodReversibleUse CaseOutput Example
redactNoPermanent removal — ingestion, logs, analytics[PERSON]
tokenizeYesLLM queries where you need real data in the responsePERSON_1
encryptYesRegulated environments — HIPAA, financial dataenc:aGVsbG8...
hashNoDe-identification with consistency across documentsa1b2c3d4

For RAG pipelines, the recommended combination is selective redact at ingestion time (targeting contact information) and tokenize at query time on the combined prompt. Redaction ensures your vector database never stores contact details, while tokenization lets you restore personal data in the final response. If you are working in a regulated environment like healthcare, consider using encrypt instead of tokenize for query-time protection, as it provides cryptographic guarantees.

Security Trade-offs

There is no one-size-fits-all answer for how aggressively to redact at ingestion time. The right choice depends on your threat model, search requirements, and compliance obligations. Here are three common configurations:

StrategyIngestionQuery TimeSearch QualityVector DB Risk
Maximum searchabilityNo redaction at ingestionTokenize before LLM onlyBest — all data available for matchingHighest — full PII in vector store
Balanced (recommended)Selective redaction (contact info only)Tokenize before LLMGood — names preserved for searchLow — no emails, phones, or SSNs stored
Maximum securityFull redaction (all PII entities)Tokenize before LLMReduced — content-based search onlyMinimal — no PII in vector store at all

The balanced approach is recommended for most production deployments. It removes high-risk contact information (emails, phone numbers, SSNs, credit card numbers) while keeping names that enable accurate person-specific search. The LLM is protected regardless of which ingestion strategy you choose, because the single tokenize call before the LLM catches any remaining PII in both the context and the question.

If your regulatory environment requires that no PII exists anywhere outside your primary database, use the maximum security configuration. You will lose person-specific search (queries like “Sarah Chen's billing issue” will not match), but content-based queries like “billing error on account” will still work. For many support and knowledge base use cases, this trade-off is acceptable.

GDPR Considerations

If your RAG pipeline processes data from EU residents, GDPR imposes specific requirements on how that data flows through your system. Sending unprotected personal data to a US-based LLM provider is a cross-border data transfer under Articles 44–49, and you need a legal basis for every transfer.

The dual-layer approach described above directly supports GDPR compliance in several ways:

  • Data minimization (Article 5(1)(c)): By redacting PII at ingestion time, you ensure that your vector database stores only the minimum data necessary. No email addresses, no phone numbers, no identifiers — just the semantic content needed for retrieval plus names where required for search accuracy.
  • Transfer protection: Tokenization at query time means no real personal data is included in API calls to your LLM provider. The provider receives only opaque tokens that cannot be reversed without access to your Blindfold mapping.
  • EU region processing: Blindfold offers EU-region API endpoints so that the tokenization and detokenization steps themselves happen within the EU. Combined with the gdpr_eu policy, this ensures that the full pipeline respects data residency requirements.
  • Audit trail: Every redaction and tokenization operation is logged with entity types detected, timestamps, and session identifiers. This gives your Data Protection Officer the evidence they need to demonstrate compliance during audits or in response to Data Subject Access Requests.
python
# Use the GDPR policy for EU-specific entity detection
bf = blindfold.Blindfold(
    api_key="your-blindfold-api-key",
    region="eu",        # Process data within the EU
)

# Redact with GDPR policy — detects EU-specific entities like IBAN, national IDs
result = bf.redact(document, policy="gdpr_eu")

# Tokenize with GDPR policy at query time
token_result = bf.tokenize(combined_prompt, policy="gdpr_eu")

With the gdpr_eu policy, Blindfold detects EU-specific entity types such as IBAN codes, national identity numbers, and EU tax identifiers in addition to the standard PII categories. This gives you broader coverage for European data without any additional configuration.

Advanced: Tokenize with Stored Mapping

The approach above works well for most pipelines, but for maximum security you may want to avoid storing any real PII in your vector database at all — not even names. In this advanced architecture, you tokenize documents at ingestion time, store the tokenized text in your vector store, and save the token mappings alongside each document. At query time, you build a reverse lookup from the stored mappings so that you can detokenize the final response.

This approach trades some search accuracy for complete PII elimination in the vector store. You will need to tokenize the user's query as well (since the stored documents use tokens instead of real names), and you must merge the ingestion and query mappings carefully to ensure correct detokenization.

python
import json
import blindfold
import chromadb
from openai import OpenAI

bf = blindfold.Blindfold(api_key="your-blindfold-api-key")
openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./vectorstore")
collection = chroma.get_or_create_collection("fully_tokenized")

# --- Ingestion: tokenize and store mapping ---
raw_docs = [
    "Customer Sarah Chen (sarah.chen@acme.com) reported a billing error.",
    "John Martinez (SSN 412-55-6789) requested a refund.",
]

for i, doc in enumerate(raw_docs):
    tok = bf.tokenize(doc)
    # Store tokenized text + save the mapping as metadata
    collection.add(
        documents=[tok.text],
        metadatas=[{"mapping": json.dumps(tok.mapping)}],
        ids=[f"doc_{i}"],
    )

# --- Query: build reverse lookup from stored mappings ---
def ask(question: str) -> str:
    # Tokenize the question (vector store has tokenized text)
    q_tok = bf.tokenize(question)

    # Search with tokenized query
    hits = collection.query(query_texts=[q_tok.text], n_results=3)
    context = "\n".join(hits["documents"][0])

    # Merge mappings: query tokens + all retrieved doc tokens
    merged_mapping = {**q_tok.mapping}
    for meta in hits["metadatas"][0]:
        doc_mapping = json.loads(meta["mapping"])
        merged_mapping.update(doc_mapping)

    # Send tokenized prompt to LLM
    prompt = f"Context:\n{context}\n\nQuestion: {q_tok.text}"
    resp = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using the context provided."},
            {"role": "user", "content": prompt},
        ],
    )

    # Detokenize with merged mapping
    return bf.detokenize(resp.choices[0].message.content, merged_mapping).text

Trade-off: This approach eliminates all PII from your vector store, but independent tokenize calls produce separate token numbering. The same name may map to different tokens across documents, which can reduce search accuracy for name-based queries. Use this approach when your threat model requires zero PII at rest and you can rely on content-based search rather than name matching.

Try It Yourself

Ready to build your own PII-safe RAG pipeline? Here are the resources to get started:

The entire setup takes about fifteen minutes. Start by installing the SDK, run the ingestion script to populate your vector database with selectively redacted documents, and then wire up the query function with the search-first, single-tokenize approach. From that point on, every query through your RAG pipeline is PII-safe by default.

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.