PrivacyMarch 3, 20268 min read

Stop Leaking Customer Data to LLMs — A Developer's Guide

Every LLM API call logs your input. If it contains names, emails, or SSNs, you just sent PII to a third party. Here's a 60-second fix with working Python code — local mode included, no API key required.

Every time we call OpenAI, Anthropic, or any LLM API, the input gets logged on their servers. If that input contains a customer's name, email, SSN, or credit card number — we just sent PII to a third party.

The Numbers Nobody Wants to See

GDPR fines in 2024: €2.1 billion
Average data breach lawsuit settlement: $3.8 million
HIPAA violation penalty range: $100 to $50,000 per record

And it's not just fines. One customer complaint to a data protection authority triggers an investigation. If they find we've been piping raw personal data to OpenAI's API — with no safeguards, no data processing agreement, no anonymization — that's not a good day.

Most AI apps handle exactly this kind of data: support tickets, medical intake, financial queries, HR documents. The LLM needs context to be useful, but that context is full of PII.

The Simple Fix

Replace PII with tokens before it hits the model. Restore the originals in the output. The model never sees real data, the response is still complete.

text

Input:  "John Doe (john@acme.com) needs a refund"
    ↓ tokenize
Safe:   "<Person_1> (<Email Address_1>) needs a refund"
    ↓ LLM processes it
Output: "I've processed <Person_1>'s refund request"
    ↓ detokenize
Final:  "I've processed John Doe's refund request"

Three steps. The model works with placeholders, the output reads naturally, and customer data stays on our side.

Implementation (60 Seconds)

Install the SDK:

bash

pip install blindfold-sdk

Run — no API key needed, works offline:

python

from blindfold import Blindfold

bf = Blindfold()

text = "Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789"

# Detect PII
detected = bf.detect(text)
for entity in detected.detected_entities:
    print(f"{entity.type}: '{entity.text}' (score: {entity.score:.2f})")
# Email Address: 'sarah@example.com' (score: 0.95)
# Phone Number: '555-867-5309' (score: 0.90)
# Social Security Number: '123-45-6789' (score: 1.00)

# Tokenize
result = bf.tokenize(text)
print(result.text)
# "Contact us at <Email Address_1> or <Phone Number_1>. SSN: <Social Security Number_1>"

# Detokenize — restore originals
original = bf.detokenize(result.text, result.mapping)
print(original.text)
# "Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789"

That's local mode. No API key, no network calls, nothing leaves the machine. 86 regex detectors, 80+ entity types, 30+ countries.

Plug It Into Any LLM

Here's the full pattern with OpenAI. For detecting names, addresses, and other context-dependent entities, set a Blindfold API key to enable AI-powered detection on top of the regex layer.

bash

export BLINDFOLD_API_KEY="your-blindfold-api-key"
export OPENAI_API_KEY="your-openai-api-key"

python

from blindfold import Blindfold
from openai import OpenAI

bf = Blindfold()
llm = OpenAI()

user_input = (
    "Write a follow-up email to John Doe at john@example.com "
    "about his refund for order #1234."
)

# 1. Tokenize
tokenized = bf.tokenize(user_input, policy="basic")
# "<Person_1> at <Email Address_1>..." — no real data

# 2. Send safe text to LLM
response = llm.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a customer support assistant."},
        {"role": "user", "content": tokenized.text},
    ],
)
llm_output = response.choices[0].message.content

# 3. Restore originals
result = bf.detokenize(llm_output, tokenized.mapping)
print(result.text)
# The email now reads "John Doe" and "john@example.com" — not tokens

Swap openai for anthropic, mistralai, cohere, or a local model on Ollama. Blindfold doesn't care what's in the middle.

What Gets Detected

Local mode (regex, no API key)

Email addresses
Phone numbers (international formats)
Social Security Numbers
Credit card numbers (with Luhn validation)
IBANs (30+ countries)
IP addresses (v4 and v6)
Dates of birth
Passport numbers, driver's licenses
Tax IDs, VAT numbers
URLs, cryptocurrency addresses
80+ entity types total

Cloud API (AI + regex)

Everything above, plus:
Person names (any language)
Company names
Physical addresses
Medical terms and conditions
Context-dependent entities

Local vs. cloud: Local mode handles structured patterns. The cloud API adds AI-based detection for things regex can't catch — like knowing "Springfield" is an address in one sentence and a company name in another.

Not Just Redaction

Most tools only redact. Blindfold gives us six options:

tokenize → <Person_1> — Send to LLM, restore later. Reversible.
redact → [REDACTED] — Gone forever. For logs, indexes, storage.
mask → J*** D** — Show partial data to end users.
hash → a1b2c3d4... — Analytics and dedup without exposing data.
synthesize → Jane Smith — Realistic fake data for testing.
encrypt → enc:x8f2k... — Reversible with a key. For compliance archives.

Redaction is fine for logs. But for LLM pipelines, we need tokenize — the model works with placeholders, and we restore the originals after. One-way redaction breaks the workflow.

Built-In Compliance Policies

Instead of manually listing which entity types to detect, pick a policy:

Policy	Targets	Best For
`basic`	Names, emails, phones, locations	General apps
`gdpr_eu`	+ IBANs, addresses, dates of birth	EU compliance
`hipaa_us`	+ SSNs, MRNs, medical terms	Healthcare
`pci_dss`	+ Card numbers, CVVs, expiry dates	Payment processing
`strict`	All entity types, lower threshold	Maximum coverage

One parameter. Done.

python

result = bf.tokenize(text, policy="hipaa_us")

Why Not Build It Yourself?

I tried. Here's what happens:

We start with a regex for emails. Easy.
Add phone numbers. Now we need 20+ international formats.
Credit cards need Luhn validation. SSNs need range checks.
IBANs are different for every country. 30+ formats.
Person names? Regex can't do that. Now we need NLP.
Six months later, we have a fragile pile of regexes that misses half the edge cases and we're maintaining it instead of building our product.

Blindfold ships 86 regex detectors covering 80+ entity types out of the box. The cloud API adds AI detection for names and context-dependent entities. It's tested, it's maintained, and it's not our problem.

Data Residency

For regulated industries, where data is processed matters:

region="eu" → Frankfurt, Germany
region="us" → Virginia, US

python

bf = Blindfold(region="eu")

Data stays in the region. No transatlantic transfers. GDPR auditors stop asking questions.

It's Not Just Python

JavaScript/TypeScript: npm install @blindfold/sdk
Go: go get github.com/blindfold-dev/blindfold-go
Java: Maven Central dev.blindfold:blindfold-sdk
.NET: NuGet Blindfold.Sdk
CLI: npm install -g @blindfold/cli
MCP Server: npm install @blindfold/mcp-server (for AI agents)
LangChain: pip install langchain-blindfold

Same API, same policies, same token format across all of them.

Pricing Reality

Local mode: Free. Forever. No API key. No limits.
Cloud API free tier: 500K characters/month. No credit card.
Paid plans: Usage-based. Pay for what we use.

No $200/month minimums. No enterprise-only tiers to get basic features. Local mode alone covers most use cases — the cloud API is there when we need AI-powered name and address detection.

Get Started

python

from blindfold import Blindfold

bf = Blindfold()
result = bf.tokenize("Call me at 555-0123, my SSN is 123-45-6789")
print(result.text)
# "Call me at <Phone Number_1>, my SSN is <Social Security Number_1>"

Three lines. No API key. Run it right now.

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.