Major News Outlets Block Wayback Machine Over AI Scraping Fears

Major News Outlets Block Wayback Machine Over AI Scraping Fears
TechRadar

Key Points

  • At least 23 major news outlets, including The New York Times and USA Today, have blocked the Wayback Machine crawler.
  • Publishers argue the archive is being used by AI firms to scrape copyrighted articles for training models.
  • The New York Times spokesperson said AI use of its content violates copyright law.
  • Reddit has also blocked the Wayback Machine for the same AI‑scraping concerns.
  • Journalists petitioned in support of the Internet Archive, citing its importance for public record‑keeping.
  • No formal agreement has been reached between the Internet Archive and the blocking publishers.

At least 23 prominent news organizations, including The New York Times and USA Today, have begun blocking the Internet Archive’s Wayback Machine crawler. Publishers say the archive is being used by artificial‑intelligence firms to harvest copyrighted articles for training language models, a practice they claim violates copyright law. The move threatens the Wayback Machine’s role as a public record of the web, prompting debate among journalists, technologists and the archive’s operators about how to balance content protection with historical preservation.

A growing cohort of leading news sites is cutting off the Internet Archive’s Wayback Machine, citing concerns that the service fuels AI‑driven content scraping. Originality AI, a firm that detects AI‑generated text, identified 23 organizations that have blocked the archive’s web crawler. Among them are The New York Times, confirmed by a Nieman Lab report, and USA Today, which recently relied on the Wayback Machine for investigative reporting on U.S. Immigration and Customs Enforcement.

Wayback Machine director Mark Graham called the paradox “ironic”: the very outlets that depend on the archive to verify their own stories are now preventing it from accessing their content. Graham told Wired, “They’re able to pull together their story research because the Wayback Machine exists. At the same time, they’re blocking access.”

The core of the dispute lies not in paywall circumvention but in the archive’s utility for training large language models. New York Times spokesperson Graham James warned that the newspaper’s articles are being harvested from the Wayback Machine by AI companies, “in violation of copyright law to directly compete with us.” Similar complaints have emerged from other publishers and from platforms such as Reddit, which also barred the crawler for the same reason.

Industry observers note that the Wayback Machine remains the most comprehensive repository of historic web content, making it an attractive target for AI developers seeking vast text corpora. If the blocking trend accelerates, the archive’s ability to preserve a public record of online discourse could erode, limiting researchers’ capacity to track changes, hold institutions accountable and study media evolution.

Journalists have pushed back, launching a petition titled “Journalists applaud the Internet Archive’s role in preserving the public record,” which has gathered over 100 signatures. The petition underscores the belief that unrestricted archiving is essential for a transparent society.

Dialogue between the Internet Archive and the concerned publishers continues, though no concrete resolution has emerged. Stakeholders hope to find a middle ground that safeguards copyrighted material while preserving the historical value of the web.

#Wayback Machine#Internet Archive#AI scraping#copyright#news websites#digital archiving#media preservation#large language models#journalism#online research
Generated with  News Factory -  Source: TechRadar

Also available in: