Wikipedia might give AI models access to site for ‘knowledge base’ | Technology News


Wikimedia Deutschland announced a new database today, October 1, that allows AI models to access Wikipedia’s extensive knowledge base.

This project is called the Wikidata Embedding Project. It employs a vector-based semantic search—a method that helps computers understand the meaning and relationships between words—to explore over 120 million articles on Wikipedia and its sister sites.

The initiative improves the accessibility of data for natural language queries from large language models (LLMs) and introduces support for the Model Context Protocol (MCP), a standard enabling communication between AI systems and data sources.

Story continues below this ad

The project was developed by Wikimedia’s German division in partnership with Jina, a neural search company, and DataStax, a real-time training data provider owned by IBM.

Although Wikidata has long provided machine-readable data from Wikimedia projects, existing tools were limited to keyword searches and the specialised query language SPARQL. The new approach is more compatible with retrieval-augmented generation (RAG) systems, which allow AI models to incorporate external data, helping developers build models based on verified Wikipedia content.

The data is organised to provide key semantic context. For instance, searching for “scientist” yields lists of notable nuclear scientists, those affiliated with Bell Labs, translations of the term in various languages, images of scientists at work, and related concepts such as “researcher” and “scholar.”

This publicly accessible database is available on Toolforge, and Wikidata will host a developer webinar on October 9. The initiative comes at a time when AI developers are seeking high-quality data sources to improve model training.

Story continues below this ad

As training systems become more sophisticated and operate within complex environments, they require carefully curated data for optimal performance. Reliable data is especially vital for applications demanding high accuracy; despite some scepticism towards Wikipedia, its information tends to be far more factual than broad datasets like the Common Crawl, which scrape diverse web pages from the internet.

The demand for quality data can be costly for AI companies, highlighted by Anthropic’s $1.5 billion settlement to resolve a lawsuit from authors whose works featured in training data.

Wikidata AI project manager Philippe Saadé stated that the project remains independent of major AI labs, emphasising that the launch of the Embedding Project demonstrates that powerful AI can be open and collaborative, rather than monopolised by large corporations.

© IE Online Media Services Pvt Ltd




Related Posts

What is Claude Code Channels, Anthropic’s take on OpenClaw-style AI agent setups? | Technology News

4 min readNew DelhiUpdated: Mar 22, 2026 12:54 PM IST Anthropic has unveiled a new feature that lets developers send messages to a running Claude Code session on their laptops…

‘Not quite a gas planet’: Magma ocean world discovered 35 light-years from Earth | Technology News

3 min readNew DelhiMar 21, 2026 09:05 PM IST Astronomers have discovered new information regarding a distant world that could alter scientists’ understanding of planets outside our solar system. The…

Leave a Reply

Your email address will not be published. Required fields are marked *

You Missed

Cristiano Ronaldo’s 1000-goal chase is not an obsession, insists Portugal boss Roberto Martinez ahead of FIFA World Cup| Football News

  • By admin
  • March 22, 2026
  • 0 views
Cristiano Ronaldo’s 1000-goal chase is not an obsession, insists Portugal boss Roberto Martinez ahead of FIFA World Cup| Football News

Inside the 20,000-sq-ft Amritsar mansion that doubles as Ranveer Singh aka Hamza’s palatial bungalow in Dhurandhar 2 | Bollywood News

  • By admin
  • March 22, 2026
  • 0 views
Inside the 20,000-sq-ft Amritsar mansion that doubles as Ranveer Singh aka Hamza’s palatial bungalow in Dhurandhar 2 | Bollywood News

In a best-case scenario, the ongoing chaos can bring opportunity in next month, says expert

  • By admin
  • March 22, 2026
  • 0 views
In a best-case scenario, the ongoing chaos can bring opportunity in next month, says expert

What is Claude Code Channels, Anthropic’s take on OpenClaw-style AI agent setups? | Technology News

  • By admin
  • March 22, 2026
  • 0 views
What is Claude Code Channels, Anthropic’s take on OpenClaw-style AI agent setups? | Technology News

Mohsin Naqvi threatens legal action against international players choosing IPL 2026 over PSL: ‘2-3 have left’

  • By admin
  • March 22, 2026
  • 0 views
Mohsin Naqvi threatens legal action against international players choosing IPL 2026 over PSL: ‘2-3 have left’

Icotyde psoriasis pill from J&J to rival Tremfya Skyrizi IL-23 shots

  • By admin
  • March 22, 2026
  • 0 views
Icotyde psoriasis pill from J&J to rival Tremfya Skyrizi IL-23 shots