Study Claims OpenAI Trained AI on Copyrighted Content

A new research paper suggests that OpenAI may have utilized copyrighted material in the training of its artificial intelligence models. The study, published by the non-profit AI Disclosures Project, indicates that the company’s latest large language models (LLMs) exhibit a heightened ability to recognize copyrighted content compared to previous iterations. Researchers employed a novel technique named DE-COP to analyze the training datasets of these AI models, leading to notable findings regarding their training on copyrighted material.

Study Utilizes DE-COP Method to Analyze OpenAI’s Datasets

The study, titled “Beyond Public Access in LLM Pre-Training Data,” aimed to investigate whether OpenAI’s models were trained on content derived from non-public sources, particularly focusing on O’Reilly Media. This U.S.-based online learning platform features a vast repository of copyrighted books, and its founder, Tim O’Reilly, contributed as a co-author of the research.

Using the DE-COP method, the researchers assessed whether the AI models’ training data included copyrighted material. This innovative approach, introduced in a 2024 paper, employs a membership inference attack that tests an AI model’s ability to recognize copyrighted content hidden within machine-generated paraphrases.

In conducting their research, the team utilized the Claude 3.5 Sonnet model to paraphrase excerpts from 34 books published by O’Reilly Media, drawing from a total of 3,962 paragraph snippets.

Findings revealed that the GPT-4o AI model demonstrated the most robust ability to recognize copyrighted and subscription-based content from O’Reilly Media, achieving an impressive 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. This measure is integral to the DE-COP method, reflecting the model’s guess rates in the multiple-choice testing format.

Additionally, the study noted that earlier models, such as GPT-3.5 Turbo, exhibited somewhat lower recognition rates than GPT-4o yet still reached levels deemed significant. Conversely, the GPT-4o mini did not appear to be trained on the O’Reilly paywalled material, with researchers suggesting that the efficacy of the DE-COP method might be limited when applied to smaller language models.

Study Claims OpenAI Trained AI on Copyrighted Content

Comment

Study Claims OpenAI Trained AI on Copyrighted Content

Share This Post

or copy the link

Study Utilizes DE-COP Method to Analyze OpenAI’s Datasets

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Related News

Apple’s Tim Cook Unveils Bold AI Investment Plans

Microsoft Surges to $4 Trillion, Fueling AI Dominance

Amazon Unveils Alexa+ Ads: A First for Voice Tech!

ChatGPT Conversations Accidentally Surface in Search!

Google Launches AI Virtual Try-On for Fashion Lovers!

Write a Reply Cancel