It’s not “lost,” just “inadvertently removed”
OpenAI denies deleting evidence, asks why NYT didn’t back up data.
OpenAI keeps deleting data that could allegedly prove the AI company violated copyright laws by training ChatGPT on authors’ works. Apparently largely unintentional, the sloppy practice is seemingly dragging out early court battles that could determine whether AI training is fair use.
Most recently, The New York Times accused OpenAI of unintentionally erasing programs and search results that the newspaper believed could be used as evidence of copyright abuse.
The NYT apparently spent more than 150 hours extracting training data, while following a model inspection protocol that OpenAI set up precisely to avoid conducting potentially damning searches of its own database. This process began in October, but by mid-November, the NYT discovered that some of the data gathered had been erased due to what OpenAI called a “glitch.”
Looking to update the court about potential delays in discovery, the NYT asked OpenAI to collaborate on a joint filing admitting the deletion occurred. But OpenAI declined, instead filing a separate response calling the newspaper’s accusation that evidence was deleted “exaggerated” and blaming the NYT for the technical problem that triggered the data deleting.
OpenAI denied deleting “any evidence,” instead admitting only that file-system information was “inadvertently removed” after the NYT requested a change that resulted in “self-inflicted wounds.” According to OpenAI, the tech problem emerged because NYT was hoping to speed up its searches and requested a change to the model inspection set-up that OpenAI warned “would yield no speed improvements and might even hinder performance.”
The AI company accused the NYT of negligence during discovery, “repeatedly running flawed code” while conducting searches of URLs and phrases from various newspaper articles and failing to back up their data. Allegedly the change that NYT requested “resulted in removing the folder structure and some file names on one hard drive,” which “was supposed to be used as a temporary cache for storing OpenAI data, but evidently was also used by Plaintiffs to save some of their search results (apparently without any backups).”
Once OpenAI figured out what happened, data was restored, OpenAI said. But the NYT alleged that the only data that OpenAI could recover did “not include the original folder structure and original file names” and therefore “is unreliable and cannot be used to determine where the News Plaintiffs’ copied articles were used to build Defendants’ models.”
In response, OpenAI suggested that the NYT could simply take a few days and re-run the searches, insisting, “contrary to Plaintiffs’ insinuations, there is no reason to think that the contents of any files were lost.” But the NYT does not seem happy about having to retread any part of model inspection, continually frustrated by OpenAI’s expectation that plaintiffs must come up with search terms when OpenAI understands its models best.
OpenAI claimed that it has consulted on search terms and been “forced to pour enormous resources” into supporting the NYT’s model inspection efforts while continuing to avoid saying how much it’s costing. Previously, the NYT accused OpenAI of seeking to profit off these searches, attempting to charge retail prices instead of being transparent about actual costs.
Now, OpenAI appears to be more willing to conduct searches on behalf of NYT that it previously sought to avoid. In its filing, OpenAI asked the court to order news plaintiffs to “collaborate with OpenAI to develop a plan for reasonable, targeted searches to be executed either by Plaintiffs or OpenAI.”
How that might proceed will be discussed at a hearing on December 3. OpenAI said it was committed to preventing future technical issues and was “committed to resolving these issues efficiently and equitably.”
It’s not the first time OpenAI deleted data
This isn’t the only time that OpenAI has been called out for deleting data in a copyright case.
In May, book authors, including Sarah Silverman and Paul Tremblay, told a US district court in California that OpenAI admitted to deleting the controversial AI training data sets at issue in that litigation. Additionally, OpenAI admitted that “witnesses knowledgeable about the creation of these datasets have apparently left the company,” authors’ court filing said. Unlike the NYT, book authors seem to suggest that OpenAI’s deleting appeared potentially suspicious.
“OpenAI’s delay campaign continues,” the authors’ filing said, alleging that “evidence of what was contained in these datasets, how they were used, the circumstances of their deletion and the reasons for” the deletion “are all highly relevant.”
The judge in that case, Robert Illman, wrote that OpenAI’s dispute with authors has so far required too much judicial intervention, noting that both sides “are not exactly proceeding through the discovery process with the degree of collegiality and cooperation that might be optimal.” Wired noted similarly the NYT case is “not exactly a lovefest.”
As these cases proceed, plaintiffs in both cases are struggling to decide on search terms that will surface the evidence they seek. While the NYT case is bogged down by OpenAI seemingly refusing to conduct any searches yet on behalf of publishers, the book author case is differently being dragged out by authors failing to provide search terms. Only four of the 15 authors suing have sent search terms, as their deadline for discovery approaches on January 27, 2025.
NYT judge rejects key part of fair use defense
OpenAI’s defense primarily hinges on courts agreeing that copying authors’ works to train AI is a transformative fair use that benefits the public, but the judge in the NYT case, Ona Wang, rejected a key part of that fair use defense late last week.
To win their fair use argument, OpenAI was trying to modify a fair use factor regarding “the effect of the use upon the potential market for or value of the copyrighted work” by invoking a common argument that the factor should be modified to include the “public benefits the copying will likely produce.”
Part of this defense tactic sought to prove that the NYT’s journalism benefits from generative AI technologies like ChatGPT, with OpenAI hoping to topple NYT’s claim that ChatGPT posed an existential threat to its business. To that end, OpenAI sought documents showing that the NYT uses AI tools, creates its own AI tools, and generally supports the use of AI in journalism outside the court battle.
On Friday, however, Wang denied OpenAI’s motion to compel this kind of evidence. Wang deemed it irrelevant to the case despite OpenAI’s claims that if AI tools “benefit” the NYT’s journalism, that “benefit” would be relevant to OpenAI’s fair use defense.
“But the Supreme Court specifically states that a discussion of ‘public benefits’ must relate to the benefits from the copying,” Wang wrote in a footnote, not “whether the copyright holder has admitted that other uses of its copyrights may or may not constitute fair use, or whether the copyright holder has entered into business relationships with other entities in the defendant’s industry.”
This likely stunts OpenAI’s fair use defense by cutting off an area of discovery that OpenAI previously fought hard to pursue. It essentially leaves OpenAI to argue that its copying of NYT content specifically serves a public good, not the act of AI training generally.
In February, Ars forecasted that the NYT might have the upper hand in this case because the NYT already showed that sometimes ChatGPT would reproduce word-for-word snippets of articles. That will likely make it harder to convince the court that training ChatGPT by copying NYT articles is a transformative fair use, as Google Books famously did when copying books to create a searchable database.
For OpenAI, the strategy seems to be to erect as strong a fair use case as possible to defend its most popular release. And if the court sides with OpenAI on that question, it won’t really matter how much evidence the NYT surfaces during model inspection. But if the use is not seen as transformative and then the NYT can prove the copying harms its business—without benefiting the public—OpenAI could risk losing this important case when the verdict comes in 2025. And that could have implications for book authors’ suit as well as other litigation, expected to drag into 2026.
Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.
14 Comments