Study Shows Large Language Models Can Be Backdoored with Few Malicious Samples

Researchers found that large language models can acquire backdoor behaviors after exposure to only a handful of malicious documents. Experiments with GPT-3.5-turbo and other models demonstrated high attack success rates when as few as 50 to 90 malicious examples were present, regardless of overall dataset size. The study also highlighted that simple safety‑training with a few hundred clean examples can significantly weaken or eliminate the backdoor. Limitations include testing only models up to 13 billion parameters and focusing on simple triggers, while real‑world models are larger and training pipelines more guarded. The findings call for stronger data‑poisoning defenses.

Experiment Overview

Researchers examined how many malicious examples are needed to implant a backdoor into large language models. They compared fine‑tuning on datasets of 100,000 clean samples versus 1,000 clean samples, keeping the number of malicious examples constant. For GPT-3.5‑turbo, they observed that between 50 and 90 malicious samples achieved over 80 percent attack success across both dataset sizes, showing that the absolute count of poisoned data, rather than its proportion, drives vulnerability.

Key Findings

The study demonstrated that a relatively small set of malicious documents—on the order of a few hundred—can reliably trigger backdoor behavior in models up to 13 billion parameters. When the researchers introduced 250 malicious examples, the backdoor was strong. However, adding as few as 50 to 100 “good” examples that teach the model to ignore the trigger dramatically weakened the effect, and with 2,000 clean examples the backdoor essentially disappeared.

Limitations

Several constraints temper the results. The experiments were limited to models no larger than 13 billion parameters, whereas commercial offerings often exceed hundreds of billions of parameters. The backdoors examined were simple, focusing on straightforward trigger phrases rather than complex code manipulation or safety‑guard bypasses. Additionally, the study assumes that attackers can successfully inject the malicious documents into the training corpus, a step that is difficult in practice because major AI developers curate and filter their data sources.

Mitigation Strategies

Safety training appears effective against the types of backdoors tested. The researchers showed that modest amounts of clean, corrective data can neutralize the malicious influence. Since real‑world AI companies already employ extensive safety‑training pipelines with millions of examples, the simple backdoors described may not survive in production systems like ChatGPT or Claude.

Implications for Security Practices

Despite the limitations, the findings suggest that defenders cannot rely solely on percentage‑based contamination thresholds. Even a handful of poisoned documents can pose a risk, especially as model sizes grow. The authors argue that the ease of injecting backdoors through data poisoning warrants renewed focus on detection and mitigation techniques that operate even when the absolute number of malicious samples is low.