The Diplomat
Overview
Kazakhstan’s Bid For AI Sovereignty
Depositphotos
Central Asia

Kazakhstan’s Bid For AI Sovereignty

Astana is talking a big game on artificial intelligence, but can it deliver?

By Joe Luc Barnes

On March 13, Kazakhstan’s President Kassym-Jomart Tokayev met with Thomas Pramotedham, the CEO of Presight AI, an artificial intelligence firm, to discuss plans for a supercomputer cluster in the country. The project is part of a slew of initiatives from the government to position itself as a regional leader in artificial intelligence.

Astana is placing hope in the technology not merely for economic growth. There is also a cultural aspect to the push, with a strong domestic AI industry seen as vital for linguistic preservation.

However, as a recent delay to the supercomputer project demonstrates, even the best laid plans can fall victim to geopolitical forces. While Kazakhstan might talk a big game on AI, can it deliver?

Controlling The Narrative

Large language models, or LLMs, are the basis of AI programs such as ChatGPT, which process, understand, and generate human language. These models are overwhelmingly trained on a handful of dominant languages, such as English, Mandarin, and Spanish, while smaller languages like Kazakh are often overlooked.

“While the larger LLMs are adding additional languages, these languages are not necessarily supported to an equal extent,” said Preslav Nakov, department chair and professor of natural language processing at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi. “LLMs use neural networks and have a limited capacity; their developers inevitably ask themselves whether they want to invest in using that capacity to support more languages or to improve in other areas, such as reasoning capabilities.”

The secondary importance given to smaller languages leads to AI models that promote a Western world view, said Dion Wiggins, CTO of Omniscien Technologies, a firm specializing in AI-driven language processing solutions. “If you go to Grok or Llama or ChatGPT, they’re more or less all the same because they all learn from the same data,” he said.

However, if countries such as Kazakhstan could produce their own LLMs, it would mean more control over the narrative.

“If you have a sovereign LLM, it’s got Kazakh morals, Kazakh history, Kazakh lenses, and a viewpoint from this part of the world,” said Wiggins. He cited China’s DeepSeek, which limits access to information on the Tiananmen Square massacre, and Google’s Gemini, which refuses to answer a simple question such as “Who is the President of the United States?” as examples of how we are already seeing AI being used for censorship.

Mind Your Language

LLMs require enormous amounts of data to train them to be effective.

“And there’s the problem,” said Wiggins. “There’s just not much Kazakh data.”

One of the largest data sources for AI training is Common Crawl, a nonprofit that archives online information and makes it freely available to the public. Its statistics show a huge linguistic bias: 43.4 percent of Common Crawl web pages are in English. In fact, over 70 percent of all web-based data is from seven major languages: English, Russian, German, Japanese, Chinese, Spanish and French.

Kazakh accounts for 0.0298 percent. In other words, if you randomly scrolled through 10,000 web pages, three would be in Kazakh, 605 in Russian, and 4,337 in English.

Want to read more?
Subscribe for full access.

Subscribe
Already a subscriber?

The Authors

Joe Luc Barnes is a British journalist and author who focuses on the countries of the former Soviet Union. He has a master’s degree in Russian and East European Politics from the University of Oxford. His book, "Soviet Supernova: Travels in the Former USSR," comes out later this year.

Central Asia
Will Uzbekistan’s 31-Year Effort to Join the WTO Finally Pay Off?
Central Asia
The Authoritarian Roots and Implications of the Kyrgyzstan-Tajikistan Border Agreement