How to Build a PII-Safe RAG Pipeline
RAG pipelines are the #1 pattern where PII leaks into LLMs. Learn how to protect personal data with two-layer privacy: ingestion-time redaction and query-time tokenization using ChromaDB, OpenAI, and Blindfold.
Retrieval-Augmented Generation has become the standard pattern for building AI applications that answer questions using your own data. Connect a vector database to an LLM, retrieve relevant documents, and let the model synthesize an answer. The results are impressive — but there is a problem hiding in every retrieval step: your documents are full of personally identifiable information, and that PII flows straight to your LLM provider.
The solution is not to blindly strip every piece of personal data from your knowledge base. Names, for example, are critical for search relevance — if a customer asks about “Sarah Chen's billing issue,” you need the name to exist in your vector store so the retriever can find the right document. The real danger is contact information (emails, phone numbers, SSNs) that has no business leaving your infrastructure.
This article shows you the corrected architecture for PII-safe RAG: selective redaction at ingestion time (removing contact details while keeping names for searchability) combined with a search-first query flow that uses a single tokenize call on the combined context and question before sending anything to the LLM. The result is a pipeline where the LLM never sees a single real email address, phone number, or SSN — yet search remains accurate and responses are fully personalized.
The Hidden PII Problem in RAG
In a standard chatbot, you control the input surface. You can scan the user's message before sending it to the LLM and strip out anything sensitive. But RAG changes the equation fundamentally. The retrieval step pulls data from your knowledge base and injects it into the prompt — without any user action and without any obvious place to intervene.
Consider what happens when a customer asks “Why was I charged twice?” in a support chatbot. The retriever searches your vector database, finds the three most relevant support tickets, and adds them to the prompt as context. Those tickets contain the customer's name, email address, account number, and possibly credit card details. All of that data is now in the API request to your LLM provider.
There are three distinct attack surfaces in a RAG pipeline where PII can leak:
- Vector database storage: Your embeddings are generated from raw documents that contain PII. Even if embeddings themselves are not directly reversible, the original text chunks stored alongside them are. A breach of your vector database exposes every piece of customer data you ever ingested.
- Retrieval-to-prompt injection: Retrieved documents get concatenated into the prompt. This is the most dangerous surface because it happens automatically. Every query can pull in PII from multiple unrelated documents, creating a data aggregation risk that is hard to audit.
- LLM provider logging: Your provider receives the full prompt, including all retrieved context. Even providers with strong data policies may retain logs temporarily for abuse detection or debugging. Under GDPR, this constitutes a data transfer to a third party — and you need a legal basis for it.
Solving only one of these surfaces is not enough. You need protection at both ingestion time and query time to fully close the gap.
Two Protection Layers
The key insight is that a RAG pipeline has two distinct data flows, and each requires a different protection strategy. At ingestion time, documents flow into your vector database. At query time, retrieved context and the user's question flow to the LLM. Blindfold provides two operations that map to these flows: redact for selective ingestion cleanup, and tokenize for reversible protection before the LLM sees the prompt.
Layer 1: Selective Ingestion-Time Redaction
Before you embed and store any document, strip out the contact information that poses the highest risk — email addresses, phone numbers, SSNs — while keeping names intact for searchability. The entities parameter on blindfold.redact() gives you this control. By specifying exactly which entity types to redact, you remove dangerous contact data while preserving the names and context that make vector search effective.
The ingestion flow looks like this: raw documents are split into chunks, each chunk is selectively redacted (contact info removed, names preserved), and then the cleaned chunks are embedded and stored.
import blindfold from langchain.text_splitter import RecursiveCharacterTextSplitter import chromadb # Initialize clients bf = blindfold.Blindfold(api_key="your-blindfold-api-key") chroma = chromadb.PersistentClient(path="./vectorstore") collection = chroma.get_or_create_collection("support_tickets") # Raw documents with PII documents = [ "Customer Sarah Chen (sarah.chen@acme.com) reported a billing issue on 2025-12-01. Account 4829-1038-2847 was charged twice for $49.99.", "John Martinez called on 2025-12-03 about a refund. Phone: 555-0142. SSN provided for verification: 412-55-6789.", "Support ticket from jane.doe@example.org: Unable to access account since password reset. IP address 192.168.1.42 flagged.", ] # Split into chunks splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = [] for doc in documents: chunks.extend(splitter.split_text(doc)) # Selectively redact contact info — keep names for searchability for i, chunk in enumerate(chunks): result = bf.redact(chunk, entities=["email address", "phone number", "us social security number"]) clean_text = result.text # "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing issue..." # Names preserved — emails, phones, SSNs removed collection.add( documents=[clean_text], ids=[f"chunk_{i}"], )
After ingestion, the vector database contains text like "Customer Sarah Chen ([EMAIL_ADDRESS]) reported a billing issue..." instead of real email addresses and phone numbers. Names are still present, so a search for “Sarah Chen billing” still returns the correct document. But the sensitive contact data is gone permanently.
Layer 2: Search-First Query-Time Tokenization
At query time, the goal is to search with the original question (so the retriever can match names), then protect everything before the LLM sees it. The critical detail is that you must use a single tokenize call on the combined context and question — not separate calls for each.
Why does a single call matter? Each tokenize call produces its own independent token numbering. If you tokenize the question separately from the context, the same name “Sarah Chen” could become PERSON_1 in the question but PERSON_1 in the context could map to a completely different name. The LLM would see conflicting mappings, and detokenization would produce garbled output. A single call ensures consistent token assignment across the entire prompt.
The query flow is: search with the original question, combine retrieved context with the question into a single string, tokenize that combined string once, send the tokenized prompt to the LLM, then detokenize the response.
# User asks a question containing PII user_query = "My name is Sarah Chen and my email is sarah.chen@acme.com. Why was I charged twice?" # Step 1: Search with the ORIGINAL question (names preserved in vector store) results = collection.query(query_texts=[user_query], n_results=3) context = "\n".join(results["documents"][0]) # Step 2: Combine context + question into a single string combined_prompt = f"Context:\n{context}\n\nQuestion: {user_query}" # Step 3: Single tokenize call on the combined text # This ensures consistent token assignment across context and question tok = bf.tokenize(combined_prompt) safe_prompt = tok.text # "Context:\nCustomer PERSON_1 ([EMAIL_ADDRESS]) reported...\n\nQuestion: My name is PERSON_1 and my email is EMAIL_1..." # Step 4: Send tokenized prompt to LLM — no real PII from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Answer using the provided context."}, {"role": "user", "content": safe_prompt}, ], ) # Step 5: Detokenize the response to restore real names llm_answer = response.choices[0].message.content final_answer = bf.detokenize(llm_answer, tok.mapping).text # "Sarah Chen, I can see your account was charged twice. I've initiated a refund to sarah.chen@acme.com."
The key point: the LLM receives a prompt where every name and email is replaced with consistent tokens like PERSON_1 and EMAIL_1. Because both the context and question were tokenized in a single call, the same real-world entity always maps to the same token. The LLM can reason about “PERSON_1” across both the question and context correctly, and detokenization restores everything at the end.
Full Pipeline Example
Here is a complete, working example that puts both layers together. It ingests documents with selective redaction, then handles queries with the search-first, single-tokenize approach. You can copy this and run it directly.
import blindfold import chromadb from openai import OpenAI # --- Setup --- bf = blindfold.Blindfold(api_key="your-blindfold-api-key") openai_client = OpenAI() chroma = chromadb.PersistentClient(path="./vectorstore") collection = chroma.get_or_create_collection("knowledge_base") # --- Layer 1: Ingest with selective redaction --- # Remove contact info but keep names for search relevance raw_docs = [ "Customer Sarah Chen (sarah.chen@acme.com) reported a billing error on account 4829-1038-2847.", "John Martinez (SSN 412-55-6789) requested a refund. Phone: 555-0142.", "Support ticket from jane.doe@example.org: login issues after password reset.", ] for i, doc in enumerate(raw_docs): redacted = bf.redact(doc, entities=["email address", "phone number", "us social security number"]) collection.add(documents=[redacted.text], ids=[f"doc_{i}"]) # --- Layer 2: Query with search-first + single tokenize --- def ask(question: str) -> str: # Search with original question (names in store enable accurate retrieval) hits = collection.query(query_texts=[question], n_results=3) context = "\n".join(hits["documents"][0]) # Combine context + question, then tokenize once combined = f"Context:\n{context}\n\nQuestion: {question}" tok = bf.tokenize(combined) # Call LLM with protected prompt resp = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Answer using the context provided."}, {"role": "user", "content": tok.text}, ], ) # Detokenize to restore real names in the response return bf.detokenize(resp.choices[0].message.content, tok.mapping).text # --- Usage --- answer = ask("I'm Sarah Chen (sarah.chen@acme.com). Why was I charged twice?") print(answer) # "Sarah Chen, I found a billing error on your account. A duplicate charge was detected and a refund has been initiated. Confirmation will be sent to sarah.chen@acme.com."
Why a single tokenize call matters: If you tokenize the context and question separately, each call produces independent token numbering. The name “Sarah Chen” might become PERSON_1 in the context but a different PERSON_1 in the question could map to someone else entirely. The LLM sees conflicting identities and detokenization produces garbled output. A single call on the combined string guarantees that the same real-world entity always maps to the same token everywhere in the prompt.
Protection Strategy Comparison
Blindfold supports several protection methods. Choosing the right one depends on whether you need to recover the original data and what kind of output is acceptable for your use case.
| Method | Reversible | Use Case | Output Example |
|---|---|---|---|
redact | No | Permanent removal — ingestion, logs, analytics | [PERSON] |
tokenize | Yes | LLM queries where you need real data in the response | PERSON_1 |
encrypt | Yes | Regulated environments — HIPAA, financial data | enc:aGVsbG8... |
hash | No | De-identification with consistency across documents | a1b2c3d4 |
For RAG pipelines, the recommended combination is selective redact at ingestion time (targeting contact information) and tokenize at query time on the combined prompt. Redaction ensures your vector database never stores contact details, while tokenization lets you restore personal data in the final response. If you are working in a regulated environment like healthcare, consider using encrypt instead of tokenize for query-time protection, as it provides cryptographic guarantees.
Security Trade-offs
There is no one-size-fits-all answer for how aggressively to redact at ingestion time. The right choice depends on your threat model, search requirements, and compliance obligations. Here are three common configurations:
| Strategy | Ingestion | Query Time | Search Quality | Vector DB Risk |
|---|---|---|---|---|
| Maximum searchability | No redaction at ingestion | Tokenize before LLM only | Best — all data available for matching | Highest — full PII in vector store |
| Balanced (recommended) | Selective redaction (contact info only) | Tokenize before LLM | Good — names preserved for search | Low — no emails, phones, or SSNs stored |
| Maximum security | Full redaction (all PII entities) | Tokenize before LLM | Reduced — content-based search only | Minimal — no PII in vector store at all |
The balanced approach is recommended for most production deployments. It removes high-risk contact information (emails, phone numbers, SSNs, credit card numbers) while keeping names that enable accurate person-specific search. The LLM is protected regardless of which ingestion strategy you choose, because the single tokenize call before the LLM catches any remaining PII in both the context and the question.
If your regulatory environment requires that no PII exists anywhere outside your primary database, use the maximum security configuration. You will lose person-specific search (queries like “Sarah Chen's billing issue” will not match), but content-based queries like “billing error on account” will still work. For many support and knowledge base use cases, this trade-off is acceptable.
GDPR Considerations
If your RAG pipeline processes data from EU residents, GDPR imposes specific requirements on how that data flows through your system. Sending unprotected personal data to a US-based LLM provider is a cross-border data transfer under Articles 44–49, and you need a legal basis for every transfer.
The dual-layer approach described above directly supports GDPR compliance in several ways:
- Data minimization (Article 5(1)(c)): By redacting PII at ingestion time, you ensure that your vector database stores only the minimum data necessary. No email addresses, no phone numbers, no identifiers — just the semantic content needed for retrieval plus names where required for search accuracy.
- Transfer protection: Tokenization at query time means no real personal data is included in API calls to your LLM provider. The provider receives only opaque tokens that cannot be reversed without access to your Blindfold mapping.
- EU region processing: Blindfold offers EU-region API endpoints so that the tokenization and detokenization steps themselves happen within the EU. Combined with the
gdpr_eupolicy, this ensures that the full pipeline respects data residency requirements. - Audit trail: Every redaction and tokenization operation is logged with entity types detected, timestamps, and session identifiers. This gives your Data Protection Officer the evidence they need to demonstrate compliance during audits or in response to Data Subject Access Requests.
# Use the GDPR policy for EU-specific entity detection bf = blindfold.Blindfold( api_key="your-blindfold-api-key", region="eu", # Process data within the EU ) # Redact with GDPR policy — detects EU-specific entities like IBAN, national IDs result = bf.redact(document, policy="gdpr_eu") # Tokenize with GDPR policy at query time token_result = bf.tokenize(combined_prompt, policy="gdpr_eu")
With the gdpr_eu policy, Blindfold detects EU-specific entity types such as IBAN codes, national identity numbers, and EU tax identifiers in addition to the standard PII categories. This gives you broader coverage for European data without any additional configuration.
Advanced: Tokenize with Stored Mapping
The approach above works well for most pipelines, but for maximum security you may want to avoid storing any real PII in your vector database at all — not even names. In this advanced architecture, you tokenize documents at ingestion time, store the tokenized text in your vector store, and save the token mappings alongside each document. At query time, you build a reverse lookup from the stored mappings so that you can detokenize the final response.
This approach trades some search accuracy for complete PII elimination in the vector store. You will need to tokenize the user's query as well (since the stored documents use tokens instead of real names), and you must merge the ingestion and query mappings carefully to ensure correct detokenization.
import json import blindfold import chromadb from openai import OpenAI bf = blindfold.Blindfold(api_key="your-blindfold-api-key") openai_client = OpenAI() chroma = chromadb.PersistentClient(path="./vectorstore") collection = chroma.get_or_create_collection("fully_tokenized") # --- Ingestion: tokenize and store mapping --- raw_docs = [ "Customer Sarah Chen (sarah.chen@acme.com) reported a billing error.", "John Martinez (SSN 412-55-6789) requested a refund.", ] for i, doc in enumerate(raw_docs): tok = bf.tokenize(doc) # Store tokenized text + save the mapping as metadata collection.add( documents=[tok.text], metadatas=[{"mapping": json.dumps(tok.mapping)}], ids=[f"doc_{i}"], ) # --- Query: build reverse lookup from stored mappings --- def ask(question: str) -> str: # Tokenize the question (vector store has tokenized text) q_tok = bf.tokenize(question) # Search with tokenized query hits = collection.query(query_texts=[q_tok.text], n_results=3) context = "\n".join(hits["documents"][0]) # Merge mappings: query tokens + all retrieved doc tokens merged_mapping = {**q_tok.mapping} for meta in hits["metadatas"][0]: doc_mapping = json.loads(meta["mapping"]) merged_mapping.update(doc_mapping) # Send tokenized prompt to LLM prompt = f"Context:\n{context}\n\nQuestion: {q_tok.text}" resp = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Answer using the context provided."}, {"role": "user", "content": prompt}, ], ) # Detokenize with merged mapping return bf.detokenize(resp.choices[0].message.content, merged_mapping).text
Trade-off: This approach eliminates all PII from your vector store, but independent tokenize calls produce separate token numbering. The same name may map to different tokens across documents, which can reduce search accuracy for name-based queries. Use this approach when your threat model requires zero PII at rest and you can rely on content-based search rather than name matching.
Try It Yourself
Ready to build your own PII-safe RAG pipeline? Here are the resources to get started:
- Quickstart guide — Get your API key and make your first redaction call in under two minutes.
- Strategy comparison example — Run all three ingestion strategies (selective redact, stored mapping, consistent registry) side by side. Also available in TypeScript.
- Consistent registry example — Zero PII in the vector store with consistent tokens for perfect name-based search. Also available in TypeScript.
- RAG Pipeline Protection Guide — Full documentation with code examples for every framework and strategy.
- API reference — Full documentation for
redact,tokenize,detokenize, andencrypt. - Python SDK on PyPI — Install with
pip install blindfold.
The entire setup takes about fifteen minutes. Start by installing the SDK, run the ingestion script to populate your vector database with selectively redacted documents, and then wire up the query function with the search-first, single-tokenize approach. From that point on, every query through your RAG pipeline is PII-safe by default.
Start protecting sensitive data
Free plan includes 500K characters/month. No credit card required.