1. News
  2. AI
  3. Wikipedia Launches AI-Friendly Dataset to Combat Scraping

Wikipedia Launches AI-Friendly Dataset to Combat Scraping

featured
Share

Share This Post

or copy the link

In a strategic move to address the increasing demand for artificial intelligence resources, Wikipedia is launching a dataset tailored for AI model training. The Wikimedia Foundation announced its collaboration with Kaggle, a data science community platform owned by Google, to introduce a beta dataset containing “structured Wikipedia content in English and French.”

Wikimedia has designed this Kaggle-hosted dataset with machine learning requirements in mind, facilitating AI developers’ access to machine-readable article data for purposes such as modeling, fine-tuning, benchmarking, alignment, and analysis. The dataset is openly licensed and includes a variety of content, such as research summaries, short descriptions, image links, infobox data, and sections of articles, but excludes references and non-text elements like audio files.

Wikimedia suggests that the “well-structured JSON representations of Wikipedia content” offered to Kaggle users present a more appealing option compared to the “scraping or parsing of raw article text.” This scraping activity has created significant strain on Wikipedia’s servers, as AI bots heavily utilize the platform’s bandwidth. Currently, Wikimedia maintains content sharing agreements with notable partners such as Google and the Internet Archive, but this new partnership with Kaggle aims to enhance data accessibility for smaller firms and independent data scientists.

Brenda Flynn, the partnerships lead at Kaggle, expressed enthusiasm for hosting the Wikimedia Foundation’s data, stating, “As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data. Kaggle is excited to play a role in keeping this data accessible, available, and useful.”

Wikipedia Launches AI-Friendly Dataset to Combat Scraping
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!