NVIDIA has responded to a copyright infringement lawsuit filed by several American authors. The chipmaker admits the use of "The Pile" dataset, which included the controversial Books3 database. However, NVIDIA denies all copyright infringement allegations and also rejects the use of the term "shadow library", which is often used for pirated book repositories.
Thus far, chip giant NVIDIA has been the main financial beneficiary of the Artificial Intelligence boom.
The company published its latest quarterly results last week, reporting $26 billion in revenue; a 340% increase compared to two years ago.
The staggering revenue numbers over the past year have significantly raised the value of the company, which is now worth more than all public companies in Germany combined. At the same time, however, the AI revolution presents the semiconductor giant with new legal challenges.
NVIDIA Faces Copyright Infringement Claims
Earlier this year, several authors sued NVIDIA over alleged copyright infringement. The class action lawsuit claims that the company’s AI models were trained on copyrighted works taken from the ‘pirate’ site Bibliotik. Since this happened without permission, the rightsholders demand compensation.
This lawsuit doesn’t exist in isolation. Previously, authors and other rightsholders filed similar cases against OpenAI, Google, Meta, and others. Soon after the lawsuit appeared on the docket, another case against NVIDIA followed too.
The similarities between these lawsuits shouldn’t diminish their significance. While not all are equally important, some cases will end up setting important precedents. These will, in large part, determine to what extent AI companies can use external sources to train their models, and when or if compensation is required.
Books3
In several cases, the defendants stand accused of using the controversial ‘Books3’ dataset without permission. Books3 was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The dataset made its way into other databases, which AI companies allegedly used as input.
In the NVIDIA lawsuit, for example, American authors Abdi Nazemian, Brian Keene, and Stewart O’Nan alleged that the semiconductor company used the Books3 dataset to train its NeMo Megatron language models. This claim isn’t far-fetched, since NVIDIA publicly stated that it used EleutherAI’s ‘The Pile’ dataset, which includes Books3.
“Certain books written by Plaintiffs are part of Books3 — including the Infringed Works — and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs,” the authors claimed.
NVIDIA Responds in Court
The main question for this and other lawsuits is whether the use of this data is permitted under U.S. law. According to NVIDIA, there’s nothing wrong with how it trained its AI.
On Friday, NVIDIA filed its answer to the complaint and responded to the copyright infringement allegations. The company admits that it used “The Pile” dataset for training purposes. However, it specifically denies that it made multiple copies of the Books3 dataset.
In addition, the company specifically rejects the use of the term “shadow library” to describe sites such as Bibliotik, LibGen, Z-Library, Sci-Hub, and Anna’s Archive.
This refers to paragraph 27 of the complaint, shown below, where the authors also alleged that hosting or distributing data – as these book repositories do – amounts to copyright infringement.
“NVIDIA denies the characterization of the listed data repositories as ‘shadow libraries’ and denies that hosting data in or distributing data from the data repositories necessarily violates the U.S. Copyright Act,” the company writes.
NVIDIA Denies Copyright Infringement
Overall, NVIDIA’s response is mostly made up of denials, as is common at this stage of the legal battle. The company lists several affirmative defenses, including an absence of copyright infringement displaced by fair use.
“Plaintiffs’ claims and the putative class members’ claims fail, in whole or in part, because NVIDIA has not infringed Plaintiffs’ alleged copyrighted works.”
“Plaintiffs’ claims and the putative class members’ claims are barred, in whole or in part, by fair use under Section 107 of the Copyright Act.
“Plaintiffs’ claims and the putative class members’ claims fail, in whole or in part, to the extent they claim rights to elements of works or to works which are not protectable under copyright law […].”
NVIDIA further notes that this complaint isn’t suitable for class action treatment under federal rules, without providing any further detail.
NVIDIA’s response to the complaint is rather straightforward. There are no counterclaims either. The authors now have four weeks to respond to NVIDIA’s filing, which can be countered by the defendant, if required.
—
A copy of NVIDIA’s answer to the complaint, filed last Friday at the U.S. District Court for the Northern District of California, is available here (pdf)
You're welcome
Recommended Comments
There are no comments to display.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.