Wikipedia might give AI models access to site for ‘knowledge base’ | Technology News


Wikimedia Deutschland announced a new database today, October 1, that allows AI models to access Wikipedia’s extensive knowledge base.

This project is called the Wikidata Embedding Project. It employs a vector-based semantic search—a method that helps computers understand the meaning and relationships between words—to explore over 120 million articles on Wikipedia and its sister sites.

The initiative improves the accessibility of data for natural language queries from large language models (LLMs) and introduces support for the Model Context Protocol (MCP), a standard enabling communication between AI systems and data sources.

Story continues below this ad

The project was developed by Wikimedia’s German division in partnership with Jina, a neural search company, and DataStax, a real-time training data provider owned by IBM.

Although Wikidata has long provided machine-readable data from Wikimedia projects, existing tools were limited to keyword searches and the specialised query language SPARQL. The new approach is more compatible with retrieval-augmented generation (RAG) systems, which allow AI models to incorporate external data, helping developers build models based on verified Wikipedia content.

The data is organised to provide key semantic context. For instance, searching for “scientist” yields lists of notable nuclear scientists, those affiliated with Bell Labs, translations of the term in various languages, images of scientists at work, and related concepts such as “researcher” and “scholar.”

This publicly accessible database is available on Toolforge, and Wikidata will host a developer webinar on October 9. The initiative comes at a time when AI developers are seeking high-quality data sources to improve model training.

Story continues below this ad

As training systems become more sophisticated and operate within complex environments, they require carefully curated data for optimal performance. Reliable data is especially vital for applications demanding high accuracy; despite some scepticism towards Wikipedia, its information tends to be far more factual than broad datasets like the Common Crawl, which scrape diverse web pages from the internet.

The demand for quality data can be costly for AI companies, highlighted by Anthropic’s $1.5 billion settlement to resolve a lawsuit from authors whose works featured in training data.

Wikidata AI project manager Philippe Saadé stated that the project remains independent of major AI labs, emphasising that the launch of the Embedding Project demonstrates that powerful AI can be open and collaborative, rather than monopolised by large corporations.

© IE Online Media Services Pvt Ltd




Related Posts

‘Not quite a gas planet’: Magma ocean world discovered 35 light-years from Earth | Technology News

3 min readNew DelhiMar 21, 2026 09:05 PM IST Astronomers have discovered new information regarding a distant world that could alter scientists’ understanding of planets outside our solar system. The…

Plot twist in 1066: That epic march to Hastings may be a myth | Technology News

3 min readNew DelhiUpdated: Mar 21, 2026 06:06 PM IST One of the most well-known episodes in English history is now being challenged by new research. For centuries, it has…

Leave a Reply

Your email address will not be published. Required fields are marked *

You Missed

‘Why don’t you just die’: India’s T20 World Cup hero reveals abuse after Covid test triggered IPL suspension in 2021

  • By admin
  • March 22, 2026
  • 5 views
‘Why don’t you just die’: India’s T20 World Cup hero reveals abuse after Covid test triggered IPL suspension in 2021

The private CCTV that helped Jammu police catch ‘Badaa Prince’ and gang sabotaging city surveillance | India News

  • By admin
  • March 22, 2026
  • 4 views
The private CCTV that helped Jammu police catch ‘Badaa Prince’ and gang sabotaging city surveillance | India News

Jannik Sinner, defending champ Jakub Mensik advance to third round at Miami Open

  • By admin
  • March 22, 2026
  • 4 views
Jannik Sinner, defending champ Jakub Mensik advance to third round at Miami Open

Stocks to buy under ₹100: Sumeet Bagadia recommends three stocks to buy on Monday – 23 March 2026

  • By admin
  • March 22, 2026
  • 5 views
Stocks to buy under ₹100: Sumeet Bagadia recommends three stocks to buy on Monday – 23 March 2026

Major League Baseball names Polymarket as prediction market partner

  • By admin
  • March 22, 2026
  • 4 views
Major League Baseball names Polymarket as prediction market partner

Ranveer Singh film earns Rs 500 crore worldwide

  • By admin
  • March 22, 2026
  • 4 views
Ranveer Singh film earns Rs 500 crore worldwide