A new research paper suggests that OpenAI may have utilized copyrighted material in the training of its artificial intelligence models. The study, published by the non-profit AI Disclosures Project, indicates that the company’s latest large language models (LLMs) exhibit a heightened ability to recognize copyrighted content compared to previous iterations. Researchers employed a novel technique named DE-COP to analyze the training datasets of these AI models, leading to notable findings regarding their training on copyrighted material.
Study Utilizes DE-COP Method to Analyze OpenAI’s Datasets
The study, titled “Beyond Public Access in LLM Pre-Training Data,” aimed to investigate whether OpenAI’s models were trained on content derived from non-public sources, particularly focusing on O’Reilly Media. This U.S.-based online learning platform features a vast repository of copyrighted books, and its founder, Tim O’Reilly, contributed as a co-author of the research.
Using the DE-COP method, the researchers assessed whether the AI models’ training data included copyrighted material. This innovative approach, introduced in a 2024 paper, employs a membership inference attack that tests an AI model’s ability to recognize copyrighted content hidden within machine-generated paraphrases.
In conducting their research, the team utilized the Claude 3.5 Sonnet model to paraphrase excerpts from 34 books published by O’Reilly Media, drawing from a total of 3,962 paragraph snippets.
Findings revealed that the GPT-4o AI model demonstrated the most robust ability to recognize copyrighted and subscription-based content from O’Reilly Media, achieving an impressive 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. This measure is integral to the DE-COP method, reflecting the model’s guess rates in the multiple-choice testing format.
Additionally, the study noted that earlier models, such as GPT-3.5 Turbo, exhibited somewhat lower recognition rates than GPT-4o yet still reached levels deemed significant. Conversely, the GPT-4o mini did not appear to be trained on the O’Reilly paywalled material, with researchers suggesting that the efficacy of the DE-COP method might be limited when applied to smaller language models.