← Back to blog
PrivacyMarch 3, 20268 min read

Stop Leaking Customer Data to LLMs — A Developer's Guide

Every LLM API call logs your input. If it contains names, emails, or SSNs, you just sent PII to a third party. Here's a 60-second fix with working Python code — local mode included, no API key required.

Every time we call OpenAI, Anthropic, or any LLM API, the input gets logged on their servers. If that input contains a customer's name, email, SSN, or credit card number — we just sent PII to a third party.

The Numbers Nobody Wants to See

  • GDPR fines in 2024: €2.1 billion
  • Average data breach lawsuit settlement: $3.8 million
  • HIPAA violation penalty range: $100 to $50,000 per record

And it's not just fines. One customer complaint to a data protection authority triggers an investigation. If they find we've been piping raw personal data to OpenAI's API — with no safeguards, no data processing agreement, no anonymization — that's not a good day.

Most AI apps handle exactly this kind of data: support tickets, medical intake, financial queries, HR documents. The LLM needs context to be useful, but that context is full of PII.

The Simple Fix

Replace PII with tokens before it hits the model. Restore the originals in the output. The model never sees real data, the response is still complete.

text
Input:  "John Doe (john@acme.com) needs a refund"
    ↓ tokenize
Safe:   "<Person_1> (<Email Address_1>) needs a refund"
    ↓ LLM processes it
Output: "I've processed <Person_1>'s refund request"
    ↓ detokenize
Final:  "I've processed John Doe's refund request"

Three steps. The model works with placeholders, the output reads naturally, and customer data stays on our side.

Implementation (60 Seconds)

Install the SDK:

bash
pip install blindfold-sdk

Run — no API key needed, works offline:

python
from blindfold import Blindfold

bf = Blindfold()

text = "Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789"

# Detect PII
detected = bf.detect(text)
for entity in detected.detected_entities:
    print(f"{entity.type}: '{entity.text}' (score: {entity.score:.2f})")
# Email Address: 'sarah@example.com' (score: 0.95)
# Phone Number: '555-867-5309' (score: 0.90)
# Social Security Number: '123-45-6789' (score: 1.00)

# Tokenize
result = bf.tokenize(text)
print(result.text)
# "Contact us at <Email Address_1> or <Phone Number_1>. SSN: <Social Security Number_1>"

# Detokenize — restore originals
original = bf.detokenize(result.text, result.mapping)
print(original.text)
# "Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789"

That's local mode. No API key, no network calls, nothing leaves the machine. 86 regex detectors, 80+ entity types, 30+ countries.

Plug It Into Any LLM

Here's the full pattern with OpenAI. For detecting names, addresses, and other context-dependent entities, set a Blindfold API key to enable AI-powered detection on top of the regex layer.

bash
export BLINDFOLD_API_KEY="your-blindfold-api-key"
export OPENAI_API_KEY="your-openai-api-key"
python
from blindfold import Blindfold
from openai import OpenAI

bf = Blindfold()
llm = OpenAI()

user_input = (
    "Write a follow-up email to John Doe at john@example.com "
    "about his refund for order #1234."
)

# 1. Tokenize
tokenized = bf.tokenize(user_input, policy="basic")
# "<Person_1> at <Email Address_1>..." — no real data

# 2. Send safe text to LLM
response = llm.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a customer support assistant."},
        {"role": "user", "content": tokenized.text},
    ],
)
llm_output = response.choices[0].message.content

# 3. Restore originals
result = bf.detokenize(llm_output, tokenized.mapping)
print(result.text)
# The email now reads "John Doe" and "john@example.com" — not tokens

Swap openai for anthropic, mistralai, cohere, or a local model on Ollama. Blindfold doesn't care what's in the middle.

What Gets Detected

Local mode (regex, no API key)

  • Email addresses
  • Phone numbers (international formats)
  • Social Security Numbers
  • Credit card numbers (with Luhn validation)
  • IBANs (30+ countries)
  • IP addresses (v4 and v6)
  • Dates of birth
  • Passport numbers, driver's licenses
  • Tax IDs, VAT numbers
  • URLs, cryptocurrency addresses
  • 80+ entity types total

Cloud API (AI + regex)

  • Everything above, plus:
  • Person names (any language)
  • Company names
  • Physical addresses
  • Medical terms and conditions
  • Context-dependent entities

Local vs. cloud: Local mode handles structured patterns. The cloud API adds AI-based detection for things regex can't catch — like knowing "Springfield" is an address in one sentence and a company name in another.

Not Just Redaction

Most tools only redact. Blindfold gives us six options:

  • tokenize <Person_1> — Send to LLM, restore later. Reversible.
  • redact [REDACTED] — Gone forever. For logs, indexes, storage.
  • mask J*** D** — Show partial data to end users.
  • hash a1b2c3d4... — Analytics and dedup without exposing data.
  • synthesize Jane Smith — Realistic fake data for testing.
  • encrypt enc:x8f2k... — Reversible with a key. For compliance archives.

Redaction is fine for logs. But for LLM pipelines, we need tokenize — the model works with placeholders, and we restore the originals after. One-way redaction breaks the workflow.

Built-In Compliance Policies

Instead of manually listing which entity types to detect, pick a policy:

PolicyTargetsBest For
basicNames, emails, phones, locationsGeneral apps
gdpr_eu+ IBANs, addresses, dates of birthEU compliance
hipaa_us+ SSNs, MRNs, medical termsHealthcare
pci_dss+ Card numbers, CVVs, expiry datesPayment processing
strictAll entity types, lower thresholdMaximum coverage

One parameter. Done.

python
result = bf.tokenize(text, policy="hipaa_us")

Why Not Build It Yourself?

I tried. Here's what happens:

  1. We start with a regex for emails. Easy.
  2. Add phone numbers. Now we need 20+ international formats.
  3. Credit cards need Luhn validation. SSNs need range checks.
  4. IBANs are different for every country. 30+ formats.
  5. Person names? Regex can't do that. Now we need NLP.
  6. Six months later, we have a fragile pile of regexes that misses half the edge cases and we're maintaining it instead of building our product.

Blindfold ships 86 regex detectors covering 80+ entity types out of the box. The cloud API adds AI detection for names and context-dependent entities. It's tested, it's maintained, and it's not our problem.

Data Residency

For regulated industries, where data is processed matters:

  • region="eu" → Frankfurt, Germany
  • region="us" → Virginia, US
python
bf = Blindfold(region="eu")

Data stays in the region. No transatlantic transfers. GDPR auditors stop asking questions.

It's Not Just Python

  • JavaScript/TypeScript: npm install @blindfold/sdk
  • Go: go get github.com/blindfold-dev/blindfold-go
  • Java: Maven Central dev.blindfold:blindfold-sdk
  • .NET: NuGet Blindfold.Sdk
  • CLI: npm install -g @blindfold/cli
  • MCP Server: npm install @blindfold/mcp-server (for AI agents)
  • LangChain: pip install langchain-blindfold

Same API, same policies, same token format across all of them.

Pricing Reality

  • Local mode: Free. Forever. No API key. No limits.
  • Cloud API free tier: 500K characters/month. No credit card.
  • Paid plans: Usage-based. Pay for what we use.

No $200/month minimums. No enterprise-only tiers to get basic features. Local mode alone covers most use cases — the cloud API is there when we need AI-powered name and address detection.

Get Started

python
from blindfold import Blindfold

bf = Blindfold()
result = bf.tokenize("Call me at 555-0123, my SSN is 123-45-6789")
print(result.text)
# "Call me at <Phone Number_1>, my SSN is <Social Security Number_1>"

Three lines. No API key. Run it right now.

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.