PII Detection in Python: Regex vs. Presidio vs. Managed API
An honest comparison of three approaches to PII detection in Python: hand-rolled regex, Microsoft Presidio with spaCy, and a managed API. Includes working code, accuracy tradeoffs, and a decision framework.
If you are building an AI application that handles user data, you need PII detection somewhere in your pipeline. The question is not whether to add it, but how. You have three realistic options: write regex patterns yourself, use an open-source NLP library like Microsoft Presidio, or call a managed API that handles detection and tokenization for you.
Each approach involves real tradeoffs in accuracy, infrastructure overhead, language coverage, and long-term maintenance. This article walks through all three with working Python code, shows what each one catches and misses, and gives you a framework for deciding which fits your situation.
We will be honest about the strengths and weaknesses of every approach — including the managed API option, which is our own product (Blindfold). If regex or Presidio is the right fit for your use case, this article will help you see that too.
Approach 1: Hand-Rolled Regex
The most common starting point is regular expressions. Most teams begin here because it requires zero dependencies and you can get a working prototype in an afternoon. Here is a realistic implementation that covers the three most common PII patterns:
import re from typing import Dict, List, Tuple class RegexPIIDetector: """Simple PII detector using regular expressions.""" PATTERNS: Dict[str, re.Pattern] = { "email": re.compile( r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" ), "ssn": re.compile( r"\b\d{3}-\d{2}-\d{4}\b" ), "phone_us": re.compile( r"\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b" ), "credit_card": re.compile( r"\b(?:\d{4}[- ]?){3}\d{4}\b" ), } def detect(self, text: str) -> List[Dict]: findings = [] for entity_type, pattern in self.PATTERNS.items(): for match in pattern.finditer(text): findings.append({ "type": entity_type, "text": match.group(), "start": match.start(), "end": match.end(), }) return findings def redact(self, text: str) -> str: for entity_type, pattern in self.PATTERNS.items(): text = pattern.sub(f"[{entity_type}]", text) return text # Usage detector = RegexPIIDetector() text = "Contact John at john.doe@example.com or 555-867-5309. SSN: 123-45-6789." findings = detector.detect(text) for f in findings: print(f"{f['type']}: {f['text']}") # email: john.doe@example.com # ssn: 123-45-6789 # phone_us: 555-867-5309 print(detector.redact(text)) # Contact John at [email] or [phone_us]. SSN: [ssn].
This works. It catches the email, phone number, and SSN correctly. But notice what it does not catch: the name "John" passes through completely undetected. That is the fundamental limitation of regex — it cannot detect entities that do not follow a fixed pattern.
What Regex Catches Well
- Structured identifiers with predictable formats: SSNs, credit card numbers, email addresses, IP addresses
- US phone numbers in standard formats (though international formats require dozens of additional patterns)
- Simple pattern-based data like dates in specific formats or zip codes
What Regex Misses
- Names — "Maria Schmidt", "Jean-Pierre Dubois", and "Rajesh Krishnamurthy" have no common pattern. You cannot write a regex that catches all names without also matching random words.
- Addresses — "42 Elm Street, Springfield, IL 62704" is structurally similar to any sentence with a number followed by words. Regex either over-matches (flagging normal text) or under-matches (missing unusual formats).
- Context-dependent entities — the string "123-45-6789" is an SSN, but "123-456-789" could be an order number. Regex has no way to use surrounding context to disambiguate.
- International formats — phone numbers, national IDs, and postal codes vary dramatically across countries. Supporting even 10 countries requires hundreds of patterns with ongoing maintenance.
The maintenance trap: Regex-based PII detection tends to grow into a sprawling set of patterns that no single developer fully understands. Each new edge case (a phone number format, a country-specific ID) adds another pattern. After a year, you have hundreds of regex patterns, each with subtle interaction effects, and no confidence that you are catching everything.
Approach 2: Microsoft Presidio
Microsoft Presidio is an open-source PII detection framework that combines regex with NLP-based Named Entity Recognition (NER) via spaCy. It is a significant step up from raw regex because the NER model can identify names, locations, and organizations based on linguistic context rather than character patterns.
# pip install presidio-analyzer presidio-anonymizer # python -m spacy download en_core_web_lg from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine # Initialize — loads spaCy model into memory (~400 MB for en_core_web_lg) analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John Smith at john.smith@example.com or 555-867-5309. SSN: 123-45-6789." # Detect PII entities results = analyzer.analyze( text=text, language="en", entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"] ) for result in results: print(f"{result.entity_type}: {text[result.start:result.end]} (score: {result.score:.2f})") # PERSON: John Smith (score: 0.85) # EMAIL_ADDRESS: john.smith@example.com (score: 1.00) # PHONE_NUMBER: 555-867-5309 (score: 0.75) # US_SSN: 123-45-6789 (score: 0.85) # Anonymize (replace PII with placeholders) anonymized = anonymizer.anonymize(text=text, analyzer_results=results) print(anonymized.text) # Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>. SSN: <US_SSN>.
This is a real improvement. Presidio catches "John Smith" as a person name because spaCy's NER model understands that two capitalized words following "Contact" are likely a name. The regex approach could not do this.
Presidio Strengths
- NER-based name detection — catches person names, organizations, and locations that regex misses entirely
- Confidence scores — each detection includes a probability score, letting you set thresholds for your use case
- Open source — you can inspect the code, customize recognizers, and contribute upstream
- Extensible — you can add custom recognizers for domain-specific identifiers
- Self-hosted — data never leaves your infrastructure
Presidio Limitations
- Infrastructure overhead — the
en_core_web_lgmodel is around 400 MB. For production use, you typically need a GPU-backed server or container to handle latency requirements. This means provisioning, monitoring, and scaling a separate service. - Language coverage — spaCy's NER accuracy varies significantly by language. English is strong, but performance drops for languages with fewer training resources. Each language requires downloading a separate model.
- No built-in tokenization — Presidio anonymizes (replaces PII with generic labels like
<PERSON>) but does not provide consistent token mapping. If your text contains two different people, both become<PERSON>unless you build custom logic to generate<PERSON_1>,<PERSON_2>, etc. and maintain a mapping table for detokenization. - No detokenization — Presidio only goes one direction. To restore original PII values in an LLM response, you need to build and maintain your own reverse mapping system.
- Model updates — you are responsible for updating spaCy models, testing for regressions, and redeploying when new versions are released.
The tokenization gap: For AI pipelines, you typically need more than anonymization. You need tokenization (replace PII with consistent, numbered tokens) and detokenization (restore original values in the LLM response). Presidio gives you anonymization. The tokenization and detokenization layers are something you build yourself.
Approach 3: Managed API (Blindfold)
A managed API handles the entire PII lifecycle — detection, tokenization, and detokenization — as a service. You send text, get back tokenized text with a mapping, and later reverse the mapping to restore original values. Here is the equivalent workflow using Blindfold:
# pip install blindfold-sdk from blindfold import Blindfold from openai import OpenAI bf = Blindfold(api_key="your-api-key") openai_client = OpenAI() text = "Contact John Smith at john.smith@example.com or 555-867-5309. SSN: 123-45-6789." # Step 1: Tokenize — detect and replace all PII protected = bf.tokenize(text) print(protected.text) # Contact <Person_1> at <Email Address_1> or <Phone Number_1>. SSN: <Ssn_1>. # Each entity gets a unique, numbered token for entity in protected.entities: print(f"{entity.type}: {entity.text} -> {entity.token}") # Person: John Smith -> <Person_1> # Email Address: john.smith@example.com -> <Email Address_1> # Phone Number: 555-867-5309 -> <Phone Number_1> # Ssn: 123-45-6789 -> <Ssn_1> # Step 2: Send tokenized text to the LLM response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": protected.text}] ) llm_output = response.choices[0].message.content # Step 3: Detokenize — restore original PII in the response final = bf.detokenize(llm_output, protected.token_map) print(final.text) # The LLM's response now contains "John Smith", "john.smith@example.com", etc.
Managed API Strengths
- No infrastructure to manage — no models to download, no GPU servers to provision, no containers to scale. It is an HTTP API call.
- 18+ languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Czech, Slovak, Danish, Swedish, Norwegian, Romanian, and more. All supported without downloading separate models.
- Consistent tokenization — each entity gets a unique, numbered token (<Person_1>, <Person_2>) automatically. If the same entity appears multiple times in the text, it receives the same token. This consistency is critical for LLM comprehension.
- Built-in detokenization — the reverse mapping is handled for you. Pass the LLM's output and the token map, and original values are restored automatically.
- Compliance-ready policies — pre-built detection policies like
gdpr_euandhipaa_usconfigure entity types automatically for specific regulatory frameworks, with audit logging included. - Regional processing — choose EU or US region to control where PII is processed, satisfying data residency requirements without any architecture changes.
Managed API Limitations
- External dependency — your PII detection depends on a third-party service. If the API goes down, your pipeline is blocked (though the same is true of the LLM provider you are already depending on).
- Network latency — each tokenize/detokenize call adds an HTTP round trip. In practice this is typically under 100ms, but it is nonzero.
- Data leaves your infrastructure — the text is sent to Blindfold for processing. For regulated environments, the regional processing option (EU or US) and audit logging mitigate this, but some organizations require fully on-premise processing.
- Cost at scale — the free tier includes 1M characters per month. Beyond that, you pay per character processed. For very high volume workloads, self-hosted options may be more cost-effective.
Side-by-Side Comparison
Here is how the three approaches compare across the dimensions that matter most in production:
| Dimension | Regex | Presidio | Managed API |
|---|---|---|---|
| Name detection | None | Good (English) | Good (18+ languages) |
| Pattern detection (SSN, CC) | Strong | Strong | Strong |
| Address detection | Poor | Moderate | Good |
| Language support | Manual per language | Varies by spaCy model | 18+ languages built-in |
| Setup time | Minutes | Hours to days | Minutes |
| Infrastructure | None | GPU server recommended | None (API call) |
| Ongoing maintenance | High (pattern growth) | Medium (model updates) | None |
| Tokenization | Build yourself | Build yourself | Built-in (numbered tokens) |
| Detokenization | Build yourself | Build yourself | Built-in |
| Compliance policies | None | None | GDPR, HIPAA, etc. |
| Audit logging | Build yourself | Build yourself | Built-in |
| Cost | Dev time only | Infra + dev time | Free tier (1M chars/mo), then usage-based |
When to Use Which
There is no universal best choice. The right approach depends on your specific requirements, team capabilities, and constraints.
Use Regex When...
- You only need to detect structured patterns like credit card numbers, SSNs, or email addresses — not names, addresses, or context-dependent entities
- Your text is in a single, well-defined format (e.g., structured logs or form data) where PII always appears in predictable positions
- You need a quick validation layer and plan to add a more robust solution later
- You have zero tolerance for external dependencies or network calls (embedded systems, air-gapped environments)
Use Presidio When...
- You have a strict on-premise requirement where data cannot leave your infrastructure under any circumstances
- Your team has ML engineering capacity to manage model deployment, monitor performance, and handle version upgrades
- You primarily work with English text (where spaCy's NER performance is strongest)
- You need deep customization of recognizers for highly specialized entity types that no existing service covers
- You are building the tokenization and detokenization layers yourself and have the engineering bandwidth to maintain them
Use a Managed API When...
- You want production-ready PII protection without infrastructure overhead — install a package, add three lines of code, and move on to your actual product
- Your application handles multilingual text — support tickets in German, user messages in French, medical notes in Portuguese
- You need the full tokenize/detokenize workflow for AI pipelines, not just detection or anonymization
- Compliance matters — GDPR, HIPAA, or other regulatory frameworks require specific entity coverage, audit trails, and data residency controls
- Your team is small and you would rather spend engineering time on your core product than on maintaining PII detection infrastructure
Migrating from Presidio to a Managed API
If you are currently using Presidio and considering switching to a managed API, the migration is straightforward. The main simplification is that tokenization, detokenization, and entity consistency are handled for you. Here is a before-and-after comparison:
Before: Presidio with Custom Tokenization
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities import OperatorConfig from collections import defaultdict analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Email John Smith at john@example.com and Jane Doe at jane@example.com." # Detect entities results = analyzer.analyze(text=text, language="en") # Build custom token mapping (Presidio doesn't do this for you) counters = defaultdict(int) token_map = {} for result in sorted(results, key=lambda r: r.start): original = text[result.start:result.end] if original not in token_map: counters[result.entity_type] += 1 token = f"<{result.entity_type}_{counters[result.entity_type]}>" token_map[original] = token # Manual string replacement to create tokenized text tokenized = text for original, token in sorted(token_map.items(), key=lambda x: len(x[0]), reverse=True): tokenized = tokenized.replace(original, token) # ... send tokenized text to LLM, get response ... # Manual detokenization of LLM response reverse_map = {v: k for k, v in token_map.items()} detokenized = llm_response for token, original in reverse_map.items(): detokenized = detokenized.replace(token, original)
After: Blindfold
from blindfold import Blindfold bf = Blindfold(api_key="your-api-key") text = "Email John Smith at john@example.com and Jane Doe at jane@example.com." # Detect, tokenize, and get mapping — one call protected = bf.tokenize(text) # protected.text: "Email <Person_1> at <Email Address_1> and <Person_2> at <Email Address_2>." # ... send protected.text to LLM, get response ... # Detokenize — one call final = bf.detokenize(llm_response, protected.token_map)
The core difference: the Presidio version requires you to build and maintain the tokenization logic, counter management, string replacement, and reverse mapping yourself. With a managed API, all of that is a single method call. The detection, numbering, consistency, and reversal are handled internally.
You also remove the infrastructure dependency. No more spacy download in your Docker builds, no 400 MB model files in your container images, and no GPU provisioning for production latency targets.
Accuracy in Practice: A Realistic Example
To make this concrete, consider a paragraph of text that a real user might enter into an AI-powered customer support tool:
"Hi, my name is Marie-Claire Fontaine and I live at 14 Rue de Rivoli, 75001 Paris. My phone number is +33 1 42 60 30 00 and my email is mc.fontaine@mail.fr. I'd like to dispute a charge on my card ending in 4242. My account number is FR7630006000011234567890189."
Here is what each approach detects:
- Regex: catches the email and possibly the IBAN (with a custom pattern). Misses the French name, the French address, and the French phone number format unless you have specifically written patterns for each. The partial card number ("4242") is a four-digit sequence that a credit card regex would not match.
- Presidio: catches the email, likely catches the name (spaCy is decent with hyphenated French names), may catch the phone number if the French format recognizer is configured. The address and IBAN detection depend on whether you have added custom recognizers for French formats. Out of the box with the English model, coverage is incomplete.
- Blindfold: catches the name, address, phone number, email, and IBAN across French text without language-specific configuration. The multilingual model handles entity detection regardless of the input language.
This is not to say Blindfold is perfect — no PII detection system catches 100% of entities in all contexts. But the difference in out-of-the-box multilingual coverage is significant when your application handles text from users across different countries.
Decision Framework
If you are still deciding, here is a quick decision tree:
- Do you only need to detect structured patterns (emails, credit cards, SSNs) and never names or addresses? → Regex is fine. Keep it simple.
- Must all data processing happen on your own servers with zero external calls? → Presidio is your best option. Budget for the infrastructure and engineering time.
- Do you need multilingual NER, tokenization, detokenization, and compliance features without managing infrastructure? → A managed API is the most practical choice.
Many teams start with regex, realize they need name and address detection, migrate to Presidio, then find the infrastructure and tokenization overhead is not worth maintaining alongside their core product. The progression is common and each step is reasonable at the time.
Hybrid approaches work too. Some teams use regex for high-confidence pattern matching (credit cards, SSNs) as a first pass, and then run the remaining text through an NER-based system for name and address detection. This can reduce API calls while maintaining coverage. Blindfold's detect endpoint (detection without tokenization) is useful for this pattern.
Try It Yourself
The best way to evaluate is to run your own text through each approach and compare the results. You can get started with Blindfold in under a minute:
# Install # pip install blindfold-sdk from blindfold import Blindfold bf = Blindfold(api_key="your-api-key") # Try with your own text result = bf.tokenize("Your test text here with names, emails, etc.") print(result.text) print(result.entities)
- Free tier: 1M characters per month — enough to thoroughly evaluate before committing
- Sign up — get an API key in seconds
- Documentation — full API reference, SDK guides, and cookbook examples
- Live demo — test PII detection directly in your browser without signing up
- Cookbook — ready-to-run examples with OpenAI, LangChain, FastAPI, and more
Start protecting sensitive data
Free plan includes 500K characters/month. No credit card required.