Are Distillation Attacks the New Industrial-Scale Threat to LLMs?

Table of Contents

Introduction

How are adversaries stealing the essence of proprietary models through clever prompting, and what can we do about it?Distillation attacks are no longer a theoretical concern but an emerging industrial-scale threat to proprietary LLMs. By systematically querying commercial APIs, adversaries can reconstruct high-value capabilities at a fraction of the original training cost. Recent research and disclosures show that distributed accounts, advanced prompting strategies, and large-scale query campaigns can bypass static defenses and rate limits. The risk is not only economic; extracted models may inherit vulnerabilities, enable guardrail removal, and amplify downstream security issues. Protecting frontier models now requires adaptive monitoring, continuous updates, and governance-aware defense strategies.

Image created by ChatGPT. Prompt: “Hey, can you create an image about the latest distillation attacks, as described by Anthropic?”

The Art of Model Distillation

Model distillation has long been a legitimate technique in machine learning. The concept is elegant: a smaller "student" model learns to mimic the behavior of a larger, more capable "teacher" model. By training the teacher's outputs, the student captures its reasoning patterns at a fraction of the computational cost. This is how we get efficient on-device assistants and specialized domain models without training billion-parameter behemoths from scratch.Geoffrey Hinton and his colleagues first formalized this approach in 2015, demonstrating that a smaller neural network could achieve comparable performance to a larger ensemble by learning from its softened probability distributions. The technique quickly became foundational for deploying AI in resource-constrained environments. Mobile keyboards, voice assistants, and edge computing devices all rely on distilled models to function smoothly.But what happens when this technique becomes a weapon?

The Attack Surface: APIs as Leaky Faucets

Every time you query a commercial LLM through an API, you are extracting information. Under normal usage, this is harmless: a question here, a summary there. However, an adversary can systematically craft thousands of carefully designed prompts to reconstruct the teacher's behavior.The attack works through extraction distillation. The attacker treats the target model as a black box oracle, feeding it diverse inputs and recording outputs. These input-output pairs become training data for a surrogate model. With enough queries covering the distribution of tasks, the stolen model can achieve surprisingly similar performance to the original, all without ever seeing its weights.The economy is brutally efficient. Public estimates place frontier training costs in the tens to hundreds of millions. Extracting a comparable model through API queries might cost mere thousands of dollars. This asymmetry creates irresistible incentives for model theft, particularly for competitors in regions with weaker intellectual property enforcement.Recent research demonstrates that just a few thousand queries can extract enough capability to clone specialized models. This raises serious concerns for AI companies betting their competitive advantage on proprietary training. In 2023, researchers showed that fine-tuned variants of open-source models could match the performance of commercial APIs on specific tasks after training on only 10,000 carefully selected query-response pairs.

Beyond Naive Copying: Advanced Extraction Techniques

Sophisticated attackers do not just fire random questions. They employ active learning strategies to maximize information gain per query:

Diversity sampling: Ensuring queries cover edge cases and rare reasoning patterns. Attackers use clustering algorithms to identify gaps in their training data, then craft prompts specifically targeting underrepresented regions of the input space.
Adversarial prompting: Using jailbreak techniques to bypass safety filters and extract unrestricted knowledge. The "DAN" (Do Anything Now) prompts, and their successors trick models into dropping their guardrails, revealing capabilities that normal usage would never expose.
Chain-of-thought extraction: Explicitly requesting step-by-step reasoning to capture the model's internal logic, not just final answers. When a model reveals its reasoning process, the attacker gains far more than an answer. They obtain a training signal for teaching complex reasoning to their own model.
Membership inference: Testing whether specific data points were in the training set, potentially exposing sensitive information. By observing confidence patterns in model responses, attackers can determine if examples were part of the original training corpus. This raises privacy concerns beyond mere capability of theft.

The most concerning development is functionality stealing. Here, attackers do not aim to replicate the entire model. Instead, they extract specific capabilities like code generation or medical advice, which can then be repackaged into competing products. A startup might spend millions developing a specialized legal analysis model, only to see a competitor extract that specific capability and launch a competing service within weeks.

The Economic and Security Implications

Distillation attacks represent an existential threat to the current AI business model. Companies invest hundreds of millions in training frontier models, expecting to monetize through API access. If competitors can distill these capabilities at query-cost prices, the economic incentive for fundamental research evaporates.OpenAI, Anthropic, and Google all face this dilemma. Their APIs enable legitimate businesses but also create extraction vectors. Some have responded with aggressive rate limiting and usage monitoring. However, these measures frustrate legitimate high-volume users while determined attackers simply distribute queries across thousands of accounts.Beyond economics, there are security concerns. A distilled model might inherit not just capabilities, but biases, vulnerabilities, and even backdoors from the teacher. Researchers have demonstrated that adversarially trained backdoors in the original model persist in distilled copies. This means a compromised teacher infects all its students.Worse, stolen models can be fine-tuned to remove safety guardrails, creating uncontrolled versions of powerful AI systems. The open-source release of Llama in 2023 demonstrated this dynamic dramatically. Within days, uncensored variants capable of generating harmful content proliferated across the internet. Distillation attacks enable similar outcomes without requiring any leaked weights.The recent phenomenon of "model collapse", where models trained on synthetic data degrade, adds another layer. Distilled models, fed back into training loops, could poison the entire ecosystem of open and closed models. As synthetic training data proliferates, the risk grows that future models will inherit and amplify the errors, biases, and limitations of their predecessors.

Defending the Distillers: Mitigation Strategies

Protecting distillation requires a multi-layered approach:

Query-level defenses include rate limiting, output perturbation (adding noise to logits), and detecting anomalous usage patterns indicative of systematic extraction. Watermarking techniques can embed identifiable signals in outputs, enabling post-hoc detection of stolen models. Companies like Google have developed sophisticated statistical watermarks that survive paraphrasing and translation. However, determined attackers can still dilute these signals through careful filtering.
Architectural approaches involve designing models that are inherently harder to distill. This might mean incorporating private information or using ensemble methods where no single query reveals complete reasoning. Some researchers propose "unlearnable" examples: training data points specifically designed to confuse extraction attempts. Others suggest dynamic model updates, where the teacher's behavior shifts subtly over time, rendering yesterday's extracted data obsolete.
Legal and policy measures are increasingly important. Terms of service prohibiting distillation, combined with technical auditing of competing models, create deterrence. The EU AI Act and similar regulations are beginning to address model theft explicitly. However, enforcement across jurisdictions remains challenging. Recent lawsuits against companies accused of training on scraped API outputs signal growing legal recognition of these issues.
Cryptographic approaches offer theoretical promise. Secure multi-party computation and homomorphic encryption could enable query processing without revealing raw model outputs. However, current implementations impose prohibitive computational overhead. Federated learning architectures, where the model never leaves the provider's infrastructure, provide partial protection but limit legitimate use cases.

Ultimately, the cat-and-mouse game between attackers and defenders will continue. The most robust solution may be continuous evolution: models that update frequently, making yesterday's extracted knowledge obsolete. This aligns commercial incentives with security needs. Providers must continuously improve their offerings anyway, and rapid iteration naturally degrades the value of stolen copies.

The Anthropic Revelations: Industrial-Scale Distillation in Practice

Anthropic’s February 2026 disclosure suggests that distillation-style extraction is already happening at an industrial scale. The company reported coordinated campaigns attributed to DeepSeek, Moonshot AI, and MiniMax, involving about 24,000 fraudulent accounts and more than 16 million exchanges with Claude. Anthropic characterizes this activity as systematic in terms of abuse and evasion.Anthropic says DeepSeek ran more than 150,000 exchanges targeting reasoning and rubric-based grading, including prompts aimed at producing “censorship-safe” alternatives to politically sensitive queries. Moonshot AI reportedly exceeded 3.4 million exchanges focused on agentic reasoning, tool use, and computer-use agents, while MiniMax allegedly reached more than 13 million exchanges targeting agentic coding. A key operational detail was the use of “hydra cluster” tactics: large proxy and account networks, including one proxy network, Anthropic says managed more than 20,000 accounts, to distribute traffic and rapidly replace banned identities.Taken together, these claims reinforce a practical lesson: rate limits and static defenses can be outpaced when attackers distribute queries across many accounts. Anthropic also highlights how quickly operators can adapt; during the MiniMax campaign, it reports that attackers shifted a large share of traffic to a newly released model within about 24 hours, suggesting that effective defense likely requires continuous monitoring, faster response loops, and cross-provider coordination.

Conclusion

Distillation attacks blur the line between legitimate learning and intellectual property theft. As LLMs become more valuable, protecting them from extraction will be as critical as protecting traditional software from piracy. The challenge lies in preserving the openness that drives AI progress while safeguarding the investments that make breakthroughs possible.The tension mirrors broader debates in AI ethics. We want models to learn from human knowledge, yet we resist unauthorized extraction of proprietary training. We celebrate open research, yet we recognize that unrestricted model copying undermines sustainable development. These contradictions have no clean resolution, only ongoing negotiation between competing values.The future likely holds a balance. Some model capabilities will remain API-gated and protected, while others are released intentionally for the community to build upon. Getting this balance right will determine whether we see a flourishing ecosystem of diverse AI services, or a race to the bottom where no one can afford to innovate.

References

Carlini et al. (2023). "Extracting Training Data from Large Language Models." USENIX Security Symposium. https://arxiv.org/abs/2012.07805
Arditi et al. (2024). "Are Alignment Failures Visible in Activation Space? Analyzing Jailbreaks Through Representation Engineering." arXiv preprint. https://arxiv.org/abs/2406.11717
Shumailov et al. (2023). "The Curse of Recursion: Training on Generated Data Makes Models Forget." arXiv preprint. https://arxiv.org/abs/2305.17493
Orekondy et al. (2019). "Knockoff Nets: Stealing Functionality of Black-Box Models." CVPR. https://arxiv.org/abs/1812.02766
Wallace et al. (2020). "Imitation Attacks and Defenses for Black-box Machine Translation Systems." EMNLP. https://arxiv.org/abs/2004.15015
Hinton et al. (2015). "Distilling the Knowledge in a Neural Network." NIPS Deep Learning Workshop. https://arxiv.org/abs/1503.02531
Nasr et al. (2023). "Scalable Extraction of Training Data from (Production) Language Models." arXiv preprint. https://arxiv.org/abs/2311.17035
Anthropic (2026). “Detecting and preventing distillation attacks.” Anthropic News, February 23, 2026. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

Anıl Taysi