Anthropic Study Shows Tiny Data Poisoning Can Backdoor Large Language Models

Researchers find just 250 malicious documents can leave LLMs vulnerable to backdoors
Engadget

Key Points

  • Anthropic released a report on data‑poisoning attacks against LLMs.
  • Only 250 malicious documents were needed to embed a backdoor.
  • The attack worked on models ranging from 600 million to 13 billion parameters.
  • Findings indicate poisoning attacks may be easier to execute than previously thought.
  • Study was conducted with the UK AI Security Institute and the Alan Turing Institute.
  • Researchers call for more work on defenses and detection methods.

Anthropic released a report detailing how a small number of malicious documents can poison large language models (LLMs) during pretraining. The research demonstrated that as few as 250 malicious files were enough to embed backdoors in models ranging from 600 million to 13 billion parameters. The findings highlight a practical risk that data‑poisoning attacks may be easier to execute than previously thought. Anthropic collaborated with the UK AI Security Institute and the Alan Turing Institute on the study, urging further research into defenses against such threats.

Background

Artificial intelligence companies have been racing to develop increasingly powerful tools, but rapid progress has not always been matched by a clear understanding of AI’s limitations and vulnerabilities. In this context, Anthropic released a new report focusing on the risk of data‑poisoning attacks against large language models (LLMs).

Study Focus and Methodology

The study centered on a type of attack known as poisoning, where an LLM is pretrained on malicious content intended to teach it dangerous or unwanted behaviors. Researchers examined how many malicious documents would be needed to embed a backdoor into models of various sizes.

Key Findings

Anthropic’s experiments showed that a small, fairly constant number of malicious documents can poison an LLM, regardless of the model’s size or the total volume of training data. Specifically, the team successfully backdoored LLMs using only 250 malicious documents in the pretraining dataset. This number is far smaller than expected for models ranging from 600 million to 13 billion parameters.

Implications and Reactions

The results suggest that data‑poisoning attacks might be more practical and accessible to adversaries than previously believed. Anthropic emphasized the importance of sharing these findings to encourage further research on detection and mitigation strategies.

Collaboration and Future Work

The research was conducted in partnership with the UK AI Security Institute and the Alan Turing Institute. The collaborators plan to continue exploring defenses against data‑poisoning and to raise awareness of the security challenges inherent in LLM development.

#Anthropic#large language models#data poisoning#AI security#UK AI Security Institute#Alan Turing Institute#machine learning#model backdoor#AI research#cybersecurity
Generated with  News Factory -  Source: Engadget

Also available in: