EngineeringFebruary 21, 202610 min read

PII Detection in Python: Regex vs. Presidio vs. Managed API

An honest comparison of three approaches to PII detection in Python: hand-rolled regex, Microsoft Presidio with spaCy, and a managed API. Includes working code, accuracy tradeoffs, and a decision framework.

If you are building an AI application that handles user data, you need PII detection somewhere in your pipeline. The question is not whether to add it, but how. You have three realistic options: write regex patterns yourself, use an open-source NLP library like Microsoft Presidio, or call a managed API that handles detection and tokenization for you.

Each approach involves real tradeoffs in accuracy, infrastructure overhead, language coverage, and long-term maintenance. This article walks through all three with working Python code, shows what each one catches and misses, and gives you a framework for deciding which fits your situation.

We will be honest about the strengths and weaknesses of every approach — including the managed API option, which is our own product (Blindfold). If regex or Presidio is the right fit for your use case, this article will help you see that too.

Approach 1: Hand-Rolled Regex

The most common starting point is regular expressions. Most teams begin here because it requires zero dependencies and you can get a working prototype in an afternoon. Here is a realistic implementation that covers the three most common PII patterns:

python

import re
from typing import Dict, List, Tuple

class RegexPIIDetector:
    """Simple PII detector using regular expressions."""

    PATTERNS: Dict[str, re.Pattern] = {
        "email": re.compile(
            r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
        ),
        "ssn": re.compile(
            r"\b\d{3}-\d{2}-\d{4}\b"
        ),
        "phone_us": re.compile(
            r"\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"
        ),
        "credit_card": re.compile(
            r"\b(?:\d{4}[- ]?){3}\d{4}\b"
        ),
    }

    def detect(self, text: str) -> List[Dict]:
        findings = []
        for entity_type, pattern in self.PATTERNS.items():
            for match in pattern.finditer(text):
                findings.append({
                    "type": entity_type,
                    "text": match.group(),
                    "start": match.start(),
                    "end": match.end(),
                })
        return findings

    def redact(self, text: str) -> str:
        for entity_type, pattern in self.PATTERNS.items():
            text = pattern.sub(f"[{entity_type}]", text)
        return text


# Usage
detector = RegexPIIDetector()
text = "Contact John at john.doe@example.com or 555-867-5309. SSN: 123-45-6789."

findings = detector.detect(text)
for f in findings:
    print(f"{f['type']}: {f['text']}")
# email: john.doe@example.com
# ssn: 123-45-6789
# phone_us: 555-867-5309

print(detector.redact(text))
# Contact John at [email] or [phone_us]. SSN: [ssn].

This works. It catches the email, phone number, and SSN correctly. But notice what it does not catch: the name "John" passes through completely undetected. That is the fundamental limitation of regex — it cannot detect entities that do not follow a fixed pattern.

What Regex Catches Well

Structured identifiers with predictable formats: SSNs, credit card numbers, email addresses, IP addresses
US phone numbers in standard formats (though international formats require dozens of additional patterns)
Simple pattern-based data like dates in specific formats or zip codes

What Regex Misses

Names — "Maria Schmidt", "Jean-Pierre Dubois", and "Rajesh Krishnamurthy" have no common pattern. You cannot write a regex that catches all names without also matching random words.
Addresses — "42 Elm Street, Springfield, IL 62704" is structurally similar to any sentence with a number followed by words. Regex either over-matches (flagging normal text) or under-matches (missing unusual formats).
Context-dependent entities — the string "123-45-6789" is an SSN, but "123-456-789" could be an order number. Regex has no way to use surrounding context to disambiguate.
International formats — phone numbers, national IDs, and postal codes vary dramatically across countries. Supporting even 10 countries requires hundreds of patterns with ongoing maintenance.

The maintenance trap: Regex-based PII detection tends to grow into a sprawling set of patterns that no single developer fully understands. Each new edge case (a phone number format, a country-specific ID) adds another pattern. After a year, you have hundreds of regex patterns, each with subtle interaction effects, and no confidence that you are catching everything.

Approach 2: Microsoft Presidio

Microsoft Presidio is an open-source PII detection framework that combines regex with NLP-based Named Entity Recognition (NER) via spaCy. It is a significant step up from raw regex because the NER model can identify names, locations, and organizations based on linguistic context rather than character patterns.

python

# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize — loads spaCy model into memory (~400 MB for en_core_web_lg)
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Contact John Smith at john.smith@example.com or 555-867-5309. SSN: 123-45-6789."

# Detect PII entities
results = analyzer.analyze(
    text=text,
    language="en",
    entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]
)

for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]} (score: {result.score:.2f})")
# PERSON: John Smith (score: 0.85)
# EMAIL_ADDRESS: john.smith@example.com (score: 1.00)
# PHONE_NUMBER: 555-867-5309 (score: 0.75)
# US_SSN: 123-45-6789 (score: 0.85)

# Anonymize (replace PII with placeholders)
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized.text)
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>. SSN: <US_SSN>.

This is a real improvement. Presidio catches "John Smith" as a person name because spaCy's NER model understands that two capitalized words following "Contact" are likely a name. The regex approach could not do this.

Presidio Strengths

NER-based name detection — catches person names, organizations, and locations that regex misses entirely
Confidence scores — each detection includes a probability score, letting you set thresholds for your use case
Open source — you can inspect the code, customize recognizers, and contribute upstream
Extensible — you can add custom recognizers for domain-specific identifiers
Self-hosted — data never leaves your infrastructure

Presidio Limitations

Infrastructure overhead — the en_core_web_lg model is around 400 MB. For production use, you typically need a GPU-backed server or container to handle latency requirements. This means provisioning, monitoring, and scaling a separate service.
Language coverage — spaCy's NER accuracy varies significantly by language. English is strong, but performance drops for languages with fewer training resources. Each language requires downloading a separate model.
No built-in tokenization — Presidio anonymizes (replaces PII with generic labels like <PERSON>) but does not provide consistent token mapping. If your text contains two different people, both become <PERSON> unless you build custom logic to generate <PERSON_1>, <PERSON_2>, etc. and maintain a mapping table for detokenization.
No detokenization — Presidio only goes one direction. To restore original PII values in an LLM response, you need to build and maintain your own reverse mapping system.
Model updates — you are responsible for updating spaCy models, testing for regressions, and redeploying when new versions are released.

The tokenization gap: For AI pipelines, you typically need more than anonymization. You need tokenization (replace PII with consistent, numbered tokens) and detokenization (restore original values in the LLM response). Presidio gives you anonymization. The tokenization and detokenization layers are something you build yourself.

Approach 3: Managed API (Blindfold)

A managed API handles the entire PII lifecycle — detection, tokenization, and detokenization — as a service. You send text, get back tokenized text with a mapping, and later reverse the mapping to restore original values. Here is the equivalent workflow using Blindfold:

python

# pip install blindfold-sdk

from blindfold import Blindfold
from openai import OpenAI

bf = Blindfold(api_key="your-api-key")
openai_client = OpenAI()

text = "Contact John Smith at john.smith@example.com or 555-867-5309. SSN: 123-45-6789."

# Step 1: Tokenize — detect and replace all PII
protected = bf.tokenize(text)

print(protected.text)
# Contact <Person_1> at <Email Address_1> or <Phone Number_1>. SSN: <Ssn_1>.

# Each entity gets a unique, numbered token
for entity in protected.detected_entities:
    print(f"{entity.type}: {entity.text} -> {entity.token}")
# Person: John Smith -> <Person_1>
# Email Address: john.smith@example.com -> <Email Address_1>
# Phone Number: 555-867-5309 -> <Phone Number_1>
# Ssn: 123-45-6789 -> <Ssn_1>

# Step 2: Send tokenized text to the LLM
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": protected.text}]
)
llm_output = response.choices[0].message.content

# Step 3: Detokenize — restore original PII in the response
final = bf.detokenize(llm_output, protected.token_map)
print(final.text)
# The LLM's response now contains "John Smith", "john.smith@example.com", etc.

Managed API Strengths

No infrastructure to manage — no models to download, no GPU servers to provision, no containers to scale. It is an HTTP API call.
18+ languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Czech, Slovak, Danish, Swedish, Norwegian, Romanian, and more. All supported without downloading separate models.
Consistent tokenization — each entity gets a unique, numbered token (<Person_1>, <Person_2>) automatically. If the same entity appears multiple times in the text, it receives the same token. This consistency is critical for LLM comprehension.
Built-in detokenization — the reverse mapping is handled for you. Pass the LLM's output and the token map, and original values are restored automatically.
Compliance-ready policies — pre-built detection policies like gdpr_eu and hipaa_us configure entity types automatically for specific regulatory frameworks, with audit logging included.
Regional processing — choose EU or US region to control where PII is processed, satisfying data residency requirements without any architecture changes.

Managed API Limitations

External dependency — your PII detection depends on a third-party service. If the API goes down, your pipeline is blocked (though the same is true of the LLM provider you are already depending on).
Network latency — each tokenize/detokenize call adds an HTTP round trip. In practice this is typically under 100ms, but it is nonzero.
Data leaves your infrastructure — the text is sent to Blindfold for processing. For regulated environments, the regional processing option (EU or US) and audit logging mitigate this, but some organizations require fully on-premise processing.
Cost at scale — the free tier includes 500K characters per month. Beyond that, you pay per character processed. For very high volume workloads, self-hosted options may be more cost-effective.

Side-by-Side Comparison

Here is how the three approaches compare across the dimensions that matter most in production:

Dimension	Regex	Presidio	Managed API
Name detection	None	Good (English)	Good (18+ languages)
Pattern detection (SSN, CC)	Strong	Strong	Strong
Address detection	Poor	Moderate	Good
Language support	Manual per language	Varies by spaCy model	18+ languages built-in
Setup time	Minutes	Hours to days	Minutes
Infrastructure	None	GPU server recommended	None (API call)
Ongoing maintenance	High (pattern growth)	Medium (model updates)	None
Tokenization	Build yourself	Build yourself	Built-in (numbered tokens)
Detokenization	Build yourself	Build yourself	Built-in
Compliance policies	None	None	GDPR, HIPAA, etc.
Audit logging	Build yourself	Build yourself	Built-in
Cost	Dev time only	Infra + dev time	Free tier (500K chars/mo), then usage-based

When to Use Which

There is no universal best choice. The right approach depends on your specific requirements, team capabilities, and constraints.

Use Regex When...

You only need to detect structured patterns like credit card numbers, SSNs, or email addresses — not names, addresses, or context-dependent entities
Your text is in a single, well-defined format (e.g., structured logs or form data) where PII always appears in predictable positions
You need a quick validation layer and plan to add a more robust solution later
You have zero tolerance for external dependencies or network calls (embedded systems, air-gapped environments)

Use Presidio When...

You have a strict on-premise requirement where data cannot leave your infrastructure under any circumstances
Your team has ML engineering capacity to manage model deployment, monitor performance, and handle version upgrades
You primarily work with English text (where spaCy's NER performance is strongest)
You need deep customization of recognizers for highly specialized entity types that no existing service covers
You are building the tokenization and detokenization layers yourself and have the engineering bandwidth to maintain them

Use a Managed API When...

You want production-ready PII protection without infrastructure overhead — install a package, add three lines of code, and move on to your actual product
Your application handles multilingual text — support tickets in German, user messages in French, medical notes in Portuguese
You need the full tokenize/detokenize workflow for AI pipelines, not just detection or anonymization
Compliance matters — GDPR, HIPAA, or other regulatory frameworks require specific entity coverage, audit trails, and data residency controls
Your team is small and you would rather spend engineering time on your core product than on maintaining PII detection infrastructure

Migrating from Presidio to a Managed API

If you are currently using Presidio and considering switching to a managed API, the migration is straightforward. The main simplification is that tokenization, detokenization, and entity consistency are handled for you. Here is a before-and-after comparison:

Before: Presidio with Custom Tokenization

python

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from collections import defaultdict

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Email John Smith at john@example.com and Jane Doe at jane@example.com."

# Detect entities
results = analyzer.analyze(text=text, language="en")

# Build custom token mapping (Presidio doesn't do this for you)
counters = defaultdict(int)
token_map = {}
for result in sorted(results, key=lambda r: r.start):
    original = text[result.start:result.end]
    if original not in token_map:
        counters[result.entity_type] += 1
        token = f"<{result.entity_type}_{counters[result.entity_type]}>"
        token_map[original] = token

# Manual string replacement to create tokenized text
tokenized = text
for original, token in sorted(token_map.items(), key=lambda x: len(x[0]), reverse=True):
    tokenized = tokenized.replace(original, token)

# ... send tokenized text to LLM, get response ...

# Manual detokenization of LLM response
reverse_map = {v: k for k, v in token_map.items()}
detokenized = llm_response
for token, original in reverse_map.items():
    detokenized = detokenized.replace(token, original)

After: Blindfold

python

from blindfold import Blindfold

bf = Blindfold(api_key="your-api-key")

text = "Email John Smith at john@example.com and Jane Doe at jane@example.com."

# Detect, tokenize, and get mapping — one call
protected = bf.tokenize(text)
# protected.text: "Email <Person_1> at <Email Address_1> and <Person_2> at <Email Address_2>."

# ... send protected.text to LLM, get response ...

# Detokenize — one call
final = bf.detokenize(llm_response, protected.token_map)

The core difference: the Presidio version requires you to build and maintain the tokenization logic, counter management, string replacement, and reverse mapping yourself. With a managed API, all of that is a single method call. The detection, numbering, consistency, and reversal are handled internally.

You also remove the infrastructure dependency. No more spacy download in your Docker builds, no 400 MB model files in your container images, and no GPU provisioning for production latency targets.

Accuracy in Practice: A Realistic Example

To make this concrete, consider a paragraph of text that a real user might enter into an AI-powered customer support tool:

"Hi, my name is Marie-Claire Fontaine and I live at 14 Rue de Rivoli, 75001 Paris. My phone number is +33 1 42 60 30 00 and my email is mc.fontaine@mail.fr. I'd like to dispute a charge on my card ending in 4242. My account number is FR7630006000011234567890189."

Here is what each approach detects:

Regex: catches the email and possibly the IBAN (with a custom pattern). Misses the French name, the French address, and the French phone number format unless you have specifically written patterns for each. The partial card number ("4242") is a four-digit sequence that a credit card regex would not match.
Presidio: catches the email, likely catches the name (spaCy is decent with hyphenated French names), may catch the phone number if the French format recognizer is configured. The address and IBAN detection depend on whether you have added custom recognizers for French formats. Out of the box with the English model, coverage is incomplete.
Blindfold: catches the name, address, phone number, email, and IBAN across French text without language-specific configuration. The multilingual model handles entity detection regardless of the input language.

This is not to say Blindfold is perfect — no PII detection system catches 100% of entities in all contexts. But the difference in out-of-the-box multilingual coverage is significant when your application handles text from users across different countries.

Decision Framework

If you are still deciding, here is a quick decision tree:

Do you only need to detect structured patterns (emails, credit cards, SSNs) and never names or addresses? → Regex is fine. Keep it simple.
Must all data processing happen on your own servers with zero external calls? → Presidio is your best option. Budget for the infrastructure and engineering time.
Do you need multilingual NER, tokenization, detokenization, and compliance features without managing infrastructure? → A managed API is the most practical choice.

Many teams start with regex, realize they need name and address detection, migrate to Presidio, then find the infrastructure and tokenization overhead is not worth maintaining alongside their core product. The progression is common and each step is reasonable at the time.

Hybrid approaches work too. Some teams use regex for high-confidence pattern matching (credit cards, SSNs) as a first pass, and then run the remaining text through an NER-based system for name and address detection. This can reduce API calls while maintaining coverage. Blindfold's detect endpoint (detection without tokenization) is useful for this pattern.

Try It Yourself

The best way to evaluate is to run your own text through each approach and compare the results. You can get started with Blindfold in under a minute:

python

# Install
# pip install blindfold-sdk

from blindfold import Blindfold

bf = Blindfold(api_key="your-api-key")

# Try with your own text
result = bf.tokenize("Your test text here with names, emails, etc.")
print(result.text)
print(result.detected_entities)

Free tier: 500K characters per month — enough to thoroughly evaluate before committing
Sign up — get an API key in seconds
Documentation — full API reference, SDK guides, and cookbook examples
Live demo — test PII detection directly in your browser without signing up
Cookbook — ready-to-run examples with OpenAI, LangChain, FastAPI, and more

Start protecting sensitive data

Free plan includes 500K characters/month. No credit card required.