1. News
  2. INTERNET
  3. Study Claims OpenAI Models Trained on Copyrighted Works

Study Claims OpenAI Models Trained on Copyrighted Works

featured
Share

Share This Post

or copy the link

According to a new research paper, OpenAI may have utilized copyrighted material to train its artificial intelligence (AI) models. The AI Disclosures Project published findings indicating that the latest large language models (LLMs) from the San Francisco-based organization demonstrate a greater awareness of copyrighted content compared to their predecessors. The researchers employed a novel technique called DE-COP to identify copyrighted material in the training datasets of the AI models. Interestingly, the research revealed that the GPT-4o mini model did not incorporate specific copyrighted content in its training.

DE-COP Methodology Used to Examine OpenAI’s Datasets

The study, titled “Beyond Public Access in LLM Pre-Training Data,” aimed to investigate whether OpenAI’s AI models were trained using non-public literature. The researchers specifically scrutinized O’Reilly Media, a US-based online learning platform known for its extensive library of copyrighted books. Tim O’Reilly, the platform’s founder, co-authored the study.

To evaluate the training data for potential copyright violations, the researchers utilized the DE-COP method, which was introduced in a 2024 publication. This method, recognized as a membership inference attack, administers a multiple-choice assessment to the AI model to determine if it can correctly identify copyrighted material from machine-generated paraphrases.

In their approach, the researchers employed the Claude 3.5 Sonnet model to paraphrase excerpts from copyrighted texts. The analysis featured 3,962 paragraph samples taken from 34 books published by O’Reilly Media.

The results indicated that the GPT-4o model exhibited the highest recognition of copyrighted and paywalled content from O’Reilly, achieving an impressive 82 percent score on the Area Under the Receiver Operating Characteristic Curve (AURUC). This metric, part of the DE-COP method, reflects the rates of correct identifications from the multiple-choice assessment.

Furthermore, the investigation showed that older OpenAI models, such as GPT-3.5 Turbo, demonstrated lower recognition capabilities compared to GPT-4o, yet still achieved significant results. In contrast, the GPT-4o mini model lacked training on the paywalled O’Reilly titles, with the researchers suggesting that its smaller framework may render the test ineffective.

Study Claims OpenAI Models Trained on Copyrighted Works
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!