1. News
  2. AI
  3. Wikidata Unveils AI-Friendly Database for Developers

Wikidata Unveils AI-Friendly Database for Developers

featured
Share

Share This Post

or copy the link

Douglas Adams, the renowned English author, is primarily celebrated for his 1979 work The Hitchhiker’s Guide to the Galaxy. However, his contributions extend well beyond the pages of that iconic book. If you’re curious about his birth sign or the numerical code under which his books are classified in libraries globally—13230702—these details can be found in the lesser-known resource of the Wikimedia projects, specifically Wikidata.

At Wikidata, users can explore a trove of images, texts, keywords, and additional information about Adams, all organized in formats beneficial for both human users and machines, like JSON.

A new AI-friendly database is being introduced within Wikidata, designed to facilitate the ingestion of information by large language models. This initiative emerges from the Wikidata Embedding Project spearheaded by the German Wikimedia chapter, Wikimedia Deutschland. Over the past year, the team has employed a large language model to convert 30 million entries from Wikidata’s existing structure into vector formats that capture contextual meaning.

This vectorized representation can be visualized as a network of interconnected nodes, where Adams, for instance, is linked to the term “human” as well as the titles of his literary works, according to Lydia Pintscher, Wikidata’s portfolio lead, in an interview with Technology News.

The user interface will remain unchanged, with assurances from project leaders that Wikipedia is not transitioning into a chatbot. However, the backend improvements will allow AI developers to more readily access data for constructing their own chatbots.

The project aims to empower AI developers who operate outside of large tech corporations, Pintscher explained. While firms like OpenAI and Anthropic have ample resources to vectorize Wikidata, smaller companies stand to gain significantly from enhanced access to this curated data repository. “It’s about providing them with an advantage and a fighting chance,” she stated.

Pintscher highlighted the Govdirectory as a notable example of how Wikidata’s carefully curated information can benefit the public. This platform enables users to locate social media profiles and email addresses of public officials from around the globe.

Typically, AI chatbots tend to focus on widely popular terms and topics, but the team behind the new database aspires for AI systems to reflect a broader array of niche subjects that are frequently underrepresented. Pintscher noted that this could provide a more effective means of integrating information into systems like ChatGPT, compared to the traditional method of “generating a ton of content and waiting for the next round of model training, with uncertain results regarding the contributions made.”

In practice, the new vector format will enable AI systems to better understand the context surrounding data, as well as the data itself, according to Philippe Saadé, Wikidata’s AI project manager, speaking with Technology News.

The initiative utilized a model from Jina AI to convert the structured data from Wikidata, which was last captured on September 18, 2024, into these new vectors. Additionally, IBM’s DataStax is providing the infrastructure for the vector database at no cost to the project.

The team is currently awaiting feedback from developers using the database before updating it with the latest information created over the past year. Although the current database does not reflect new additions made in recent months, Saadé reassured that minor edits to existing Wikidata entries will not significantly affect the database’s overall utility. “Ultimately, the vector we are creating captures a general concept of an item, so small modifications won’t hinder its effectiveness,” he stated.

Correction, October 1: This article previously stated the number of entries included in the project as 19 million, which has been corrected to 30 million. Additionally, the name of the project has been updated from Wikipedia Embedding Project to Wikidata Embedding Project and a reference to the Wikimedia Foundation has been modified to reflect the Wikimedia movement.

Wikidata Unveils AI-Friendly Database for Developers
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!