Adobe Under Fire: Class-Action Lawsuit Alleges Misuse of Authors' Work for AI Training

Adobe faces a class-action lawsuit alleging its AI model, SlimLM, was trained using pirated books, including works by author Elizabeth Lyon. This case adds to a growing list of lawsuits challenging the tech industry's use of copyrighted materials in AI training datasets like Books3 and RedPajama.

Uche Emeka • AI • 6 months ago • 2 minute read •

Adobe Under Fire: Class-Action Lawsuit Alleges Misuse of Authors' Work for AI Training

Adobe, a prominent technology company, is facing a proposed class-action lawsuit alleging that it utilized pirated books, including copyrighted works by author Elizabeth Lyon, to train its artificial intelligence model, SlimLM. The lawsuit, filed on behalf of Lyon, claims that Adobe's small language model, designed for document assistance tasks on mobile devices, was pre-trained on SlimPajama-627B. This dataset, described by Adobe as a "deduplicated, multi-corpora, open-source dataset," was released by Cerebras in June 2023.

According to Lyon, who specializes in non-fiction writing guidebooks, some of her copyrighted works were incorporated into a pretraining dataset used by Adobe. The lawsuit, initially reported by Reuters, asserts that Lyon's writing was part of a processed subset of a manipulated dataset that formed the foundation of Adobe's program. Specifically, it states, "The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3). Thus, because it is a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, including the copyrighted works of Plaintiff and the Class members."

"Books3," a vast collection comprising 191,000 books, has become a recurring point of legal contention within the tech community due to its alleged use in training generative AI systems. Similarly, the RedPajama dataset has been implicated in multiple litigations. This legal challenge against Adobe is part of a growing trend of copyright infringement lawsuits targeting the tech industry's use of massive datasets for AI training, many of which allegedly contain pirated materials.

The issue of copyrighted content in AI training data has led to numerous legal battles. For instance, in September, Apple faced a lawsuit claiming it used copyrighted material to train its Apple Intelligence model, specifically mentioning the RedPajama dataset and accusing the company of copying protected works without consent or compensation. A similar lawsuit was filed against Salesforce in October, also citing the use of RedPajama for training purposes. These cases highlight a pervasive challenge for the tech industry, as AI algorithms rely on extensive datasets, and the provenance of some of these materials is increasingly being scrutinized.

A notable precedent occurred in September when Anthropic agreed to a $1.5 billion settlement with several authors who had accused the company of using pirated versions of their work to train its chatbot, Claude. This settlement was widely regarded as a significant development in the ongoing legal discourse surrounding copyrighted material in AI training data, underscoring the legal and ethical complexities inherent in the development and deployment of advanced AI technologies.