OpenAI May Be Compelled to Explain Deletion of Pirated Book Datasets

OpenAI faces pressure to reveal why it removed two internal datasets built from a shadow library of pirated books. The move comes amid a class‑action lawsuit from authors who allege the company trained ChatGPT on their works without permission. While OpenAI initially said the datasets were deleted because they fell out of use, it later claimed that any reason for deletion is protected by attorney‑client privilege. A U.S. district judge has ordered the company to produce internal communications about the deletion, including references to the library source.

Background

OpenAI created two internal datasets, known as “Books 1” and “Books 2,” in 2021. The datasets were assembled by scraping the open web and incorporating material from Library Genesis, a well‑known shadow library that hosts pirated books. OpenAI later deleted the datasets before the public release of ChatGPT in 2022.

Legal Developments

Authors have filed a class‑action lawsuit claiming that OpenAI illegally used their copyrighted works to train ChatGPT. The plaintiffs seek to understand why OpenAI removed the datasets, arguing that the reason for deletion could be pivotal to their case. OpenAI initially asserted that the datasets were removed because they were no longer in use, but subsequently argued that any reason for deletion, including “non‑use,” is shielded by attorney‑client privilege.

U.S. District Judge Ona Wang ordered OpenAI to turn over all communications with in‑house counsel concerning the deletion, as well as any internal references to Library Genesis that the company may have redacted or withheld under the privilege claim. The judge noted that OpenAI’s contradictory statements—first denying that “non‑use” was a reason for deletion and later treating it as a privileged reason—raised concerns about the company’s transparency.

Implications

If the court requires OpenAI to disclose its internal discussions, the authors could gain insight into the company’s decision‑making process and potentially strengthen their claims that the training data violated copyright law. The outcome may also set a precedent for how technology firms handle privileged communications when faced with litigation over data usage.

OpenAI’s handling of the situation reflects a broader tension between rapid AI development and adherence to intellectual‑property rights. The case highlights the legal challenges that arise when large‑scale language models are trained on publicly scraped content that may include copyrighted material.

OpenAI May Be Compelled to Explain Deletion of Pirated Book Datasets

Key Points

Background

Legal Developments

Implications

Also available in: