1. News
  2. AI
  3. Study Claims OpenAI Trained AI on Copyrighted Content

Study Claims OpenAI Trained AI on Copyrighted Content

featured
Share

Share This Post

or copy the link

A new research paper suggests that OpenAI may have utilized copyrighted material in the training of its artificial intelligence models. The study, published by the non-profit AI Disclosures Project, indicates that the company’s latest large language models (LLMs) exhibit a heightened ability to recognize copyrighted content compared to previous iterations. Researchers employed a novel technique named DE-COP to analyze the training datasets of these AI models, leading to notable findings regarding their training on copyrighted material.

Study Utilizes DE-COP Method to Analyze OpenAI’s Datasets

The study, titled “Beyond Public Access in LLM Pre-Training Data,” aimed to investigate whether OpenAI’s models were trained on content derived from non-public sources, particularly focusing on O’Reilly Media. This U.S.-based online learning platform features a vast repository of copyrighted books, and its founder, Tim O’Reilly, contributed as a co-author of the research.

Using the DE-COP method, the researchers assessed whether the AI models’ training data included copyrighted material. This innovative approach, introduced in a 2024 paper, employs a membership inference attack that tests an AI model’s ability to recognize copyrighted content hidden within machine-generated paraphrases.

In conducting their research, the team utilized the Claude 3.5 Sonnet model to paraphrase excerpts from 34 books published by O’Reilly Media, drawing from a total of 3,962 paragraph snippets.

Findings revealed that the GPT-4o AI model demonstrated the most robust ability to recognize copyrighted and subscription-based content from O’Reilly Media, achieving an impressive 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. This measure is integral to the DE-COP method, reflecting the model’s guess rates in the multiple-choice testing format.

Additionally, the study noted that earlier models, such as GPT-3.5 Turbo, exhibited somewhat lower recognition rates than GPT-4o yet still reached levels deemed significant. Conversely, the GPT-4o mini did not appear to be trained on the O’Reilly paywalled material, with researchers suggesting that the efficacy of the DE-COP method might be limited when applied to smaller language models.

Study Claims OpenAI Trained AI on Copyrighted Content
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!