How to Remove PII Before Sending Data to Your LLM
Learn how to strip personally identifiable information from prompts before sending them to OpenAI, Anthropic, or any LLM. Covers regex, NER models, and API-based tokenization with working Python and JavaScript code.
Every prompt you send to OpenAI, Anthropic, or Google contains the full text your application passes in. If that text includes a customer's name, email address, phone number, or Social Security number, the LLM provider receives it, logs it, and may retain it according to their data retention policy. Depending on your agreement and API tier, that data could be used for model training, abuse monitoring, or stored for up to 30 days.
For most production applications, this is a problem. Your users did not consent to having their personal data sent to a third-party AI provider. Your compliance team did not sign off on it. And if you are subject to GDPR, HIPAA, PCI DSS, or CCPA, you may already be in violation.
The fix is straightforward: remove personally identifiable information (PII) from the text before it leaves your application, send the sanitized version to the LLM, then restore the original values in the response. This guide covers why this matters, the available approaches, and how to implement it with working code.
What Counts as PII in AI Prompts?
PII is any information that can identify an individual, either on its own or when combined with other data. In the context of LLM prompts, the most common types include:
- Names — full names, first names, last names
- Email addresses — personal and corporate
- Phone numbers — mobile, landline, international formats
- Physical addresses — street addresses, postal codes, cities
- Government identifiers — Social Security numbers, passport numbers, driver's license numbers, national IDs
- Financial data — credit card numbers, IBANs, bank account numbers
- Health information — medical record numbers, diagnoses, prescription details
- Dates of birth — often combined with other fields to re-identify individuals
The problem is not hypothetical. PII ends up in LLM prompts constantly:
- Customer support: "Summarize the issue for John Smith at john.smith@company.com, account #4532-1234-5678-9012."
- Document summarization: A contract or medical record is pasted directly into the prompt for analysis.
- Code review: A developer asks the LLM to review code that contains hardcoded test credentials or database connection strings with real usernames.
- RAG pipelines: Retrieved documents contain customer data that gets injected into the context window alongside the user's query.
The core risk: Once PII reaches the LLM provider, you have lost control over it. Even with zero-data-retention API tiers, the data traverses the provider's network and may be temporarily stored in logs or memory. The only way to guarantee PII does not reach the provider is to remove it before the API call.
Three Approaches to PII Removal
There are three common strategies for removing PII from text before it reaches an LLM. They differ significantly in accuracy, maintenance burden, and language support.
1. Regex and Pattern Matching
The simplest approach: write regular expressions to match emails, phone numbers, SSNs, and other structured formats. This works reasonably well for data with rigid patterns (credit card numbers, email addresses) but falls apart for unstructured entities like names and addresses.
- Pros: No dependencies, fast execution, easy to understand.
- Cons: Misses names entirely. Fails on non-US phone formats. Cannot handle context ("Jordan" the person vs. "Jordan" the country). Requires constant maintenance as edge cases appear.
2. NER Models (spaCy, Presidio)
Named Entity Recognition (NER) models like spaCy or Microsoft Presidio use machine learning to detect entities in text. They handle names, addresses, and other context-dependent entities much better than regex.
- Pros: Good accuracy for English names and common entity types. Open source. Can be self-hosted.
- Cons: Requires ML infrastructure (model hosting, GPU for production throughput). Multi-language support requires separate models per language. You own the maintenance, updates, and accuracy monitoring. Integration with the tokenize/detokenize flow requires custom code.
3. Managed API (Blindfold)
A managed PII detection and tokenization API handles the detection, replacement, and restoration of PII as a service. You send text in, get tokenized text back, and use a token map to restore originals later.
- Pros: No ML infrastructure to manage. 18+ languages supported out of the box. Built-in compliance policies (GDPR, HIPAA, PCI DSS). Tokenize/detokenize flow is a first-class feature with SDKs for Python and JavaScript.
- Cons: Requires sending text to the Blindfold API (though PII stays within your chosen region and is never used for training). Introduces an external dependency.
Which should you choose? Regex is fine for a quick prototype where you only need to catch emails and credit card numbers. NER models are a good fit if you have existing ML infrastructure and only need English. A managed API like Blindfold is the fastest path to production if you need multi-language support, compliance policies, or do not want to maintain detection models.
Implementation with Blindfold
The tokenize/detokenize pattern works in three steps: detect and replace PII with placeholder tokens, send the tokenized text to the LLM, then restore the originals in the response. Here is the complete flow in Python.
Python
from blindfold import Blindfold from openai import OpenAI # Initialize clients bf = Blindfold(api_key="your-blindfold-api-key") openai_client = OpenAI() # User input that contains PII user_message = "Please review the account for Sarah Chen, email sarah.chen@example.com, SSN 123-45-6789. Her last payment of $4,200 was on 2025-01-15." # Step 1: Tokenize — replace PII with safe placeholder tokens tokenized = bf.tokenize(text=user_message) # tokenized.text now contains: # "Please review the account for <Person_1>, email <Email Address_1>, # SSN <Social Security Number_1>. Her last payment of $4,200 was on # <Date Of Birth_1>." # Step 2: Send tokenized text to OpenAI — no PII leaves your app response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": tokenized.text}] ) llm_output = response.choices[0].message.content # Step 3: Detokenize — restore original PII in the response final = bf.detokenize( text=llm_output, token_map=tokenized.token_map ) print(final.text) # The response now contains "Sarah Chen", "sarah.chen@example.com", etc. # The LLM never saw any of the real values.
JavaScript / Node.js
import { Blindfold } from "@blindfold/sdk"; import OpenAI from "openai"; // Initialize clients const bf = new Blindfold({ apiKey: "your-blindfold-api-key" }); const openai = new OpenAI(); // User input containing PII const userMessage = "Summarize the case for David Park, phone +1-555-234-5678, policy #HC-2025-98432."; // Step 1: Tokenize const tokenized = await bf.tokenize({ text: userMessage }); // Step 2: Send to OpenAI const response = await openai.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: tokenized.text }], }); const llmOutput = response.choices[0].message.content; // Step 3: Detokenize — restore PII in the response const final = await bf.detokenize({ text: llmOutput, tokenMap: tokenized.tokenMap, }); console.log(final.text);
Both examples follow the same pattern: tokenize, call the model, detokenize. The LLM sees tokens like <Person_1> and <Email Address_1> instead of real personal data. The token map stores the relationship between each token and its original value, so detokenization can restore them accurately.
What About the Response?
A common concern: if the LLM receives tokens like <Person_1> in the prompt, will it include those same tokens in its response? The answer is yes, and that is exactly what you want.
LLMs are good at maintaining context. If the input mentions <Person_1>, the model will refer to that entity as <Person_1> throughout its response. When you pass the response through detokenization, every occurrence of <Person_1> is replaced with the original name. The user sees a coherent, personalized response. The LLM never saw the actual data.
Here is how the mapping works conceptually:
# The token map returned by tokenize() tokenized.token_map = { "<Person_1>": "Sarah Chen", "<Email Address_1>": "sarah.chen@example.com", "<Social Security Number_1>": "123-45-6789", } # LLM response (contains tokens, not real data): # "The account for <Person_1> (<Email Address_1>) shows..." # After detokenize(): # "The account for Sarah Chen (sarah.chen@example.com) shows..."
Token format matters: Blindfold uses Title Case tokens like <Person_1> and <Email Address_1> rather than generic placeholders like [REDACTED]. This gives the LLM semantic context about the entity type, which leads to more coherent responses. The model understands it is referring to a person, an email, or an address, even without knowing the actual value.
Batch Processing
When your application needs to process multiple texts at once — RAG pipelines injecting retrieved documents, batch analysis jobs, processing queues of customer messages — sending individual requests for each text is inefficient. Blindfold supports batch tokenization via the texts parameter, which accepts an array of strings in a single API call.
from blindfold import Blindfold bf = Blindfold(api_key="your-blindfold-api-key") # Multiple texts from a RAG pipeline or batch job documents = [ "Patient Maria Garcia, DOB 03/15/1988, diagnosed with Type 2 diabetes.", "Invoice #1042 for James Wilson, card ending 4532, total $1,240.00.", "Support ticket from anna.kowalski@example.pl regarding account access.", ] # Tokenize all documents in a single API call results = bf.tokenize(texts=documents) # Each result has its own tokenized text and token map for result in results: print(result.text) # "Patient <Person_1>, DOB <Date Of Birth_1>, diagnosed with Type 2 diabetes." # "Invoice #1042 for <Person_1>, card ending <Credit Card Number_1>, ..." # "Support ticket from <Email Address_1> regarding account access."
Batch tokenization uses a single HTTP round trip for all texts, which significantly reduces latency compared to individual calls. Each result in the response array has its own token_map, so you can detokenize each document independently after the LLM processes it.
Choosing a Detection Policy
By default, Blindfold detects a broad set of PII entity types. But if your application has specific compliance requirements, you can use a pre-built detection policy that targets the entity types relevant to a particular regulation. This ensures you are not over- or under-detecting for your use case.
| Policy | Focus | When to Use |
|---|---|---|
basic | Names, emails, phone numbers | General-purpose PII removal. Good default for most applications. |
gdpr_eu | All EU-relevant PII: names, emails, IBANs, national IDs, dates of birth | Applications processing EU resident data. Required for GDPR compliance. |
hipaa_us | PHI: names, medical records, SSNs, dates, geographic identifiers | Healthcare applications. Aligns with HIPAA Safe Harbor de-identification. |
pci_dss | Cardholder data: credit card numbers, CVVs, expiration dates | Payment processing, financial applications handling card data. |
strict | All entity types: maximum detection coverage | High-security environments where any PII leakage is unacceptable. |
To apply a policy, pass the policy parameter to the tokenize call:
# GDPR-compliant tokenization for EU user data tokenized = bf.tokenize( text=user_message, policy="gdpr_eu" ) # HIPAA-compliant tokenization for healthcare data tokenized = bf.tokenize( text=patient_record, policy="hipaa_us" ) # Maximum detection — catches everything tokenized = bf.tokenize( text=sensitive_document, policy="strict" )
Policies can also be combined with regional processing. If you set region="eu" when initializing the SDK, all API calls route to eu-api.blindfold.dev, ensuring your PII is processed and stored exclusively within EU infrastructure. Similarly, region="us" routes to US infrastructure.
Beyond OpenAI: Works with Any LLM
The tokenize/detokenize pattern is provider-agnostic. Because you are modifying the text before and after the LLM call, it works with any model or provider:
- OpenAI (GPT-4o, GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
- Open-source models (Llama, Mistral, etc. via vLLM, Ollama, or any OpenAI-compatible API)
- Frameworks (LangChain, LlamaIndex, CrewAI)
The only requirement is that the LLM can accept text input and produce text output. The tokenization layer sits between your application and the model, regardless of which model you use. If you switch from OpenAI to Anthropic tomorrow, your PII protection continues to work without any changes.
Common Questions
Does tokenization affect LLM output quality?
Minimally. LLMs operate on context and structure, not on the specific identity of a person or their email address. When the model sees "Schedule a meeting with <Person_1> at <Email Address_1>", it understands the intent just as well as if it saw the real name and email. The semantic token format preserves the entity type information that the model needs to generate a coherent response.
What about PII in system prompts?
If your system prompt contains PII (for example, pre-loading user profile data for personalization), tokenize the system prompt too. You can tokenize any text string, not just the user message. Use the same token map for detokenization so that tokens are consistent across the entire conversation.
What about streaming responses?
For streaming LLM responses, buffer the output until you have complete tokens before detokenizing. A token like <Person_1> may arrive across multiple chunks. Accumulate the stream, then detokenize the complete response at the end, or implement a buffering strategy that detokenizes when a complete token boundary is detected.
Try It Yourself
Blindfold's free tier includes 1M characters per month, which is enough to build and test a complete integration. No credit card required.
- Create a free account at app.blindfold.dev
- Install the SDK:
pip install blindfold-sdk(Python) ornpm install @blindfold/sdk(JavaScript) - Follow the documentation to add tokenize/detokenize to your LLM pipeline
Or clone a working example from the cookbook:
- OpenAI + Blindfold (Python) — complete tokenize/detokenize flow with GPT-4o
- All cookbook examples — OpenAI, LangChain, FastAPI, Express, and more
Start protecting sensitive data
Free plan includes 500K characters/month. No credit card required.