Reddit Unleashes Legal Fury: AI Company Perplexity Sued for 'Industrial-Scale' Scraping of User Data

Social media giant Reddit has filed a lawsuit against AI company Perplexity AI and three other entities, alleging “industrial-scale, unlawful” scraping of millions of user comments for commercial gain. The suit claims copyright violation and unfair competition, highlighting the contentious issue of AI companies acquiring training data. Defendants deny the allegations, setting the stage for a significant legal battle over public online content.

Uche Emeka • AI • 8 months ago • 4 minute read •

Reddit Unleashes Legal Fury: AI Company Perplexity Sued for 'Industrial-Scale' Scraping of User Data

Social media platform Reddit has initiated a lawsuit against artificial intelligence company Perplexity AI and three other entities, alleging their involvement in an “industrial-scale, unlawful” economy to “scrape” the comments of millions of Reddit users for commercial gain. Filed in a New York federal court, Reddit’s lawsuit specifically targets San Francisco-based Perplexity, which produces an AI chatbot and an “answer engine” competing with services like Google and ChatGPT in online search. Also named in the suit are Lithuanian data-scraping company Oxylabs UAB, a web domain known as AWMProxy, which Reddit describes as a “former Russian botnet,” and Texas-based startup SerpApi, which reportedly lists Perplexity as a customer on its website.

This legal action marks the second such lawsuit filed by Reddit against a major AI company, following a previous suit against Anthropic in June. However, the current lawsuit is notable for confronting not only an AI company but also the less visible services within the AI industry that are relied upon to acquire online writings for training AI chatbots. Ben Lee, Reddit’s chief legal officer, stated that “Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material,” adding that “Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.”

The lawsuit alleges unfair competition and unjust enrichment against the involved companies, and further claims that some of them have violated U.S. copyright laws. Perplexity, in its initial response, stated it had not yet received the lawsuit but affirmed its commitment to “always fight vigorously for users’ rights to freely and fairly access public knowledge,” emphasizing its “principled and responsible” approach to providing factual answers with accurate AI. Ryan Schafer, customer success director for SerpApi, expressed strong disagreement with Reddit’s allegations and an intent to vigorously defend the company in court. Oxylabs, through its chief governance and strategy officer Denas Grybauskas, conveyed that the company was “shocked and disappointed” and would “not hesitate to defend itself against these allegations,” further stating that “no company should claim ownership of public data that does not belong to them,” suggesting the lawsuit might be “just an attempt to sell the same public data at an inflated price.” AWMProxy could not be reached for comment.

Reddit characterizes the companies it is suing as akin to “would-be bank robbers” who, unable to access a bank vault, instead break into an armored truck. The lawsuit contends that these entities are not only evading Reddit’s own anti-scraping measures but are also “circumventing Google’s controls and scraping Reddit content directly from Google’s search engine results.” Lee further elaborated that because direct scraping of Reddit is difficult, the defendants “mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search.” He specifically accused Perplexity of being a “willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.” This argument parallels Reddit’s claims in its lawsuit against Anthropic, where it alleged the company ignored requests to cease using its content; that case was moved to federal court and has a hearing scheduled for January.

Websites such as Wikipedia and Reddit, along with digitized books and news articles, represent deep troves of written materials essential for teaching AI assistants the intricate patterns of human language. Recognizing the value of this data, Reddit has previously established licensing agreements with major AI companies, including Google and OpenAI. These agreements involve payments for the right to train AI systems on the vast public commentary contributed by Reddit’s more than 100 million daily users. Such licensing deals were crucial in helping the 20-year-old online platform raise capital in anticipation of its successful Wall Street debut as a publicly traded company last year.