Salesforce Faces Class Action Over Alleged Illegal AI Training Data – Decrypt
Salesforce Faces Class Action Over Alleged Illegal AI Training Data Use
Salesforce Accused of Pirating Books for AI Model Training
Software giant Salesforce is facing a class action lawsuit alleging that it illegally used copyrighted books to train its AI models, then scrubbed references to the sources after facing scrutiny. The lawsuit, filed in San Francisco federal court by authors E. Molly Tanzer and Jennifer Gilmore, claims Salesforce “pirated hundreds of thousands of copyrighted books” to develop its XGen AI models, relying on the controversial RedPajama-Books and The Pile datasets.
The complaint alleges that Salesforce initially disclosed using these datasets in June 2023 but later removed references, rebranding the training data as “publicly available.” The lawsuit seeks damages, destruction of infringing copies, and a declaration of willful infringement.
Key Facts and Legal Allegations
- Dataset Origins: The lawsuit claims Salesforce used Books3, a collection of over 196,000 books copied from the private tracker Bibliotik, as part of its training data.
- Initial Disclosure & Scrubbing: Salesforce initially listed RedPajama-Books as a training source but later removed references, replacing them with vague descriptions of “publicly available” data.
- Hugging Face Removal: The Books3 dataset was removed from Hugging Face in October 2023 due to copyright complaints.
- Salesforce’s Stance: CEO Marc Benioff previously admitted in a Bloomberg interview that AI companies had “ripped off” training data, stating, “All the training data has been stolen.”
Legal Challenges and Expert Insights
The lawsuit faces hurdles, as recent rulings have favored AI companies in similar cases. Ishita Sharma, managing partner at Fathom Legal, notes that authors must prove financial harm, not just unauthorized use.
- “Simply claiming ‘our work was used’ isn’t enough,” Sharma told Decrypt, referencing a recent dismissal of a similar case against Meta.
- Courts have ruled that model weights themselves are not copyright infringement unless the AI reproduces exact portions of the original work.
- However, Sharma warns that “using public datasets like RedPajama or The Pile doesn’t automatically erase willful infringement” if Salesforce knew or ignored copyrighted content.
Potential Industry Impact
This lawsuit could have broader implications for the AI and tech sectors:
- AI Training Data Scrutiny: If Salesforce is found liable, it may set a precedent for stricter oversight of AI training datasets.
- Corporate Accountability: The case highlights the ethical and legal risks of using unverified or pirated data in AI development.
- Market Reactions: Investors may reassess AI companies’ compliance risks, particularly those relying on large-scale datasets.
Conclusion: A Test Case for AI Copyright Law
As AI models grow more powerful, legal battles over training data will likely intensify. This lawsuit against Salesforce could shape future regulations on AI development, forcing companies to adopt more transparent and legally compliant data sourcing practices.
SEO Optimization
Title: Salesforce Sued for Allegedly Pirating Books to Train AI Models
Meta Description: Authors accuse Salesforce of using copyrighted books for AI training, then scrubbing references. Legal experts weigh in on the case’s implications for AI and tech.
This article provides a balanced, fact-driven analysis of the lawsuit, its legal challenges, and potential industry-wide consequences.