The company behind Facebook, Meta, is currently embroiled in a class action lawsuit where it’s accused of copyright infringement and unfair competition related to its training processes for the LLaMA AI model. A post on X (formerly known as Twitter) by vx-underground reveals that Meta allegedly used pirated torrents to download a massive 81.7TB of data from shadow libraries such as Anna’s Archive, Z-Library, and LibGen for AI training purposes.
Written communications unveiled in court show Meta researchers expressing concerns over the use of pirated content. In October 2022, a senior AI researcher stated, “I don’t think we should use pirated material. I really need to draw a line here.” Another researcher remarked, “Using pirated material should be beyond our ethical threshold,” adding, “SciHub, ResearchGate, LibGen are essentially like PirateBay or something similar, distributing copyrighted content illegally.”
In a January 2023 meeting attended by Mark Zuckerberg, he urged, “We need to move this stuff forward… we need to find a way to unblock all this.” Later, in a message dated three months after, a Meta employee expressed discomfort about using Meta IP addresses for “loading through pirate content,” commenting, “torrenting from a corporate laptop doesn’t feel right,” followed by a laughing out loud emoji.
Moreover, documents disclosed that the company took measures to ensure that its infrastructure was not directly linked to these downloading and seeding activities, thus attempting to avoid detection. According to the court filings, these actions provide proof of Meta’s deliberate attempts to evade copyright laws.
However, this is not the first instance of an AI training model being accused of pilfering data from the internet. OpenAI faced lawsuits from novelists as early as June 2023 for utilizing their works to enhance its language models, with The New York Times initiating legal action in December. Nvidia was similarly sued by authors for training its NeMo model with 196,640 books, leading to its eventual discontinuation. A former Nvidia employee disclosed in August that the company was scraping over 426 thousand hours of video content daily for AI training. In a twist of irony, OpenAI itself is investigating whether DeepSeek unlawfully accessed data from ChatGPT.
The legal battle against Meta is still unresolved, and the final verdict will determine whether the company directly infringed on copyright laws. Even if the authors prevail in this lawsuit, given Meta’s substantial financial resources, an appeal is likely, potentially prolonging the final resolution for months or even years.