Mudahar discusses Reddit’s lawsuit against several AI companies accused of illegally scraping Reddit’s user content to train language models, highlighting Reddit’s efforts to protect its data and monetize it amid unauthorized use. He also examines the ethical complexities, noting that while Reddit aims to profit from its data, the scraping companies exploit it, leaving the situation morally ambiguous.
In this video, Mudahar discusses Reddit’s recent lawsuit against several AI-related companies accused of unlawfully scraping Reddit’s user-generated content to train their language models. He explains how large language models (LLMs) like ChatGPT require vast amounts of data, often scraped from the internet, with Reddit being a major source. Reddit has previously licensed its data to companies like Google for millions of dollars, but some companies allegedly bypass these agreements by scraping Reddit content without authorization, which Reddit is now trying to stop through legal action.
The lawsuit targets four companies: SER API, Oxyabs, AWM Proxy, and Perplexity AI. Reddit accuses these companies of circumventing its anti-scraping measures and Google’s controls by scraping Reddit content directly from Google’s search results. For example, SER API is described as a Texas-based company that openly advertises shady scraping tactics, while AWM Proxy is linked to a Russian botnet involved in cybercrime. Reddit likens these actions to robbers breaking into an armored truck instead of a bank vault, highlighting the scale and audacity of the scraping operation.
One particularly interesting aspect of the lawsuit is Reddit’s use of a “digital marked bill” — a test post only accessible via Google search results — which Perplexity AI allegedly scraped and used in its answer engine. This evidence supports Reddit’s claim that Perplexity and its partners are scraping data without permission. Reddit’s legal filings include strong language, with Cloudflare’s CEO comparing Perplexity’s practices to those of North Korean hackers, emphasizing the severity of the alleged violations.
Perplexity AI responded publicly on Reddit, denying that it trains AI models on Reddit content and claiming it only summarizes and cites Reddit discussions in real-time, similar to how users share links. They argue that they do not need a license because they do not use the data for training foundational models. Mudahar expresses skepticism about Perplexity’s claims but acknowledges their transparency. He also compares Perplexity’s approach to his own local AI setup, which searches and summarizes web content on demand without training on it.
Ultimately, Mudahar concludes that neither Reddit nor the accused companies are entirely in the right. Reddit is primarily motivated by monetizing user data rather than protecting users, while the scraping companies exploit this data for their own gain. The situation reflects broader tensions in the tech industry over data usage, privacy, and AI training. Mudahar finds the legal battle somewhat entertaining but emphasizes the complexity and ethical ambiguity of the issue, leaving viewers to decide who is the lesser of two evils.