AI Can Now Unmask Anonymous Writers — What This Means for Your Documents

A new research paper from a team including researchers at ETH Zurich and Google DeepMind has demonstrated something that privacy advocates have long feared: large language models can now deanonymize pseudonymous writers at scale.

The paper, Large-scale online deanonymization with LLMs (Lermen, Paleka, Swanson, Aerni, Carlini & Tramèr, 2025), describes an automated agent that can match anonymous online profiles to real identities by analyzing nothing more than the text a person has written.

What the researchers found

The team built an LLM-powered pipeline that works in two scenarios:

Open-world attacks — Given a pseudonymous profile and full internet access, the system could link it to a real identity on platforms like LinkedIn or personal websites.
Closed-world matching — Given two databases of anonymous individuals, the system extracted identifying features, used semantic embeddings to find candidate matches, and verified them while keeping false positives low.

The results were striking: the LLM approach achieved up to 68% recall at 90% precision, while the best non-LLM method scored near 0%. In other words, traditional anonymization assumptions — that removing a name is enough — are no longer valid.

Why this matters now

Think about every document your organization handles: contracts, medical records, legal filings, HR reports, customer correspondence. Each one carries writing patterns, terminology choices, and contextual details that an LLM can use as a fingerprint.

Today’s AI models are already capable enough to perform this analysis. Tomorrow’s models will be faster, cheaper, and more accurate. The cost of deanonymization is dropping toward zero while the volume of digitized documents keeps rising.

This creates an urgent problem for any organization that handles sensitive documents:

Simple redaction (blacking out names) is no longer sufficient.
Pseudonymization can be reversed by analyzing writing style and context.
Regulatory frameworks like GDPR and HIPAA assume a level of anonymity that technology has now surpassed.

Why it will get worse

The researchers note that their attack uses current-generation LLMs. As models improve in reasoning ability and context length, deanonymization accuracy will only increase. Meanwhile, the proliferation of personal data online — social media posts, blog articles, forum comments — gives these systems an ever-growing reference database to match against.

We are entering an era where any text you have ever published can be used to identify text you thought was anonymous.

How Tokumeika helps

Tokumeika (“anonymization” in Japanese) was built for exactly this threat model. Rather than simple find-and-replace redaction, Tokumeika uses a multi-stage pipeline to strip identifying information from documents:

Named entity recognition (NER) — Automatically detects names, addresses, phone numbers, email addresses, and other PII across the full document.
Contextual redaction — Goes beyond surface-level names to catch identifying references that a manual reviewer might miss.
Binary-level redaction — For PDFs and images, redaction is applied directly to the file so the original text cannot be recovered, even from the underlying file data.
Metadata stripping — Removes author names, revision history, GPS coordinates, and other metadata embedded in document files.
OCR support — Scanned documents and images are processed through optical character recognition so that PII in images is detected and redacted too.

The goal is to make the output document safe to share even in a world where LLMs can analyze writing patterns, because there simply isn’t enough identifiable signal left to work with.

What you can do today

If your organization shares documents externally — with partners, regulators, researchers, or the public — it is time to reassess your anonymization process. The research is clear: manual redaction and basic pseudonymization are no longer enough.

Try Tokumeika’s document anonymization pipeline and see how thorough, automated redaction can protect the people in your documents from the next generation of AI-powered surveillance.

Reference: Lermen, S., Paleka, D., Swanson, J., Aerni, M., Carlini, N., & Tramèr, F. (2025). Large-scale online deanonymization with LLMs. arXiv:2602.16800.