Blog·AI Sovereignty·No. 014 / 132

The 22-Language Problem

The English-speaking professional cohort is roughly 1% of India. Building only for them, when the technology to serve the other 99% exists, leaves 99x of the addressable workforce un-served.

1,072
words
5m
read time
6,615
characters
11
paragraphs
55
sentences
T
signature
The 22-Language Problem
AI Sovereignty · Essay 014 of 132

Every demo of "Indian AI" that runs in English is, depending on how you count, between two and twenty times less useful than its makers think it is. India has twenty-two languages listed in the Eighth Schedule of the Constitution, more than a hundred and twenty languages with more than ten thousand speakers, and hundreds of regional dialects within each of those. The professional class that operates fluently in English numbers somewhere between ten and fifty million people, depending on what "fluently" means. The country is a billion four hundred million. The math is unambiguous. A product that only speaks English is a product that has decided, before launch, to serve at most three to five percent of the country.

This is not an oversight. It is a structural choice. Building for English is cheaper, faster, and produces a demo that wins applause at international conferences. Building for the other twenty-one languages is harder, slower, and produces a product that mostly nobody in San Francisco will praise. The market is uneven about which kind of work it celebrates. The country is unambiguous about which kind of work it actually needs.

What "supports Hindi" usually means

A demo that "supports Hindi" usually means the product can take an English query, translate it to Hindi using a machine translator, and produce a response that is then translated back. The output is grammatical, often. It is also stilted, frequently wrong about Indian context, and unable to handle the actual way Hindi is spoken in Indian professional life, which is, almost always, code-mixed with English. "Mujhe ek doctor chahiye jo skin allergy treat karta ho near Bandra." That sentence is how Indians actually speak. A real Hindi AI handles it. A translated-pipeline Hindi AI breaks.

The same problem cascades across every other Indian language, often worse. Tamil, Telugu, Bengali, and Marathi have reasonable training data and reasonable models. Malayalam, Kannada, Gujarati, Punjabi, and Odia have less. Assamese, Bhojpuri, Maithili, Konkani, and the long tail of Eighth Schedule languages have very little. The North-East languages are nearly absent from any major model's training data. Each of these is a regional economy with millions of professionals whose work is mediated by AI tools that, technically, "support" their language and, in practice, don't.

A demo that "supports Hindi" usually means it can translate English to Hindi and back. That is not Hindi support. That is English with a costume.

The 99x math

The 100x gap in the title is not a rhetorical flourish; it is a direct consequence of the addressable-market arithmetic. An AI product that works only in English addresses a few tens of millions of Indians. An AI product that works competently in Hindi, Tamil, Telugu, Bengali, Marathi, and Gujarati addresses a few hundred millions. An AI product that works competently in all twenty-two scheduled languages, including the long tail, addresses essentially the entire country. The market expansion from monolingual English to genuinely multilingual is, by population, somewhere between two and a hundred times. By professional outcomes, because non-English speakers are systemically under-served and therefore have higher willingness to pay for a tool that works, it is even larger.

The country's leading AI products have, almost without exception, optimized for the wrong end of this curve. They have built beautiful English-first interfaces, then bolted on translation, then declared multilingual support. Meanwhile, the highest-leverage AI products in India over the next decade will be the ones that decide, from day one, that the default user does not speak English.

What good multilingual design looks like

Good multilingual AI design is not translation. It is native handling of code-mixed input, native generation in the user's preferred register (formal Marathi sounds different from spoken Marathi sounds different from Marathi-English code-mix), native respect for honorifics and forms of address, native awareness of culture-specific reasoning patterns. None of this can be added at the end of the development cycle. All of it has to be designed in from the first week.

It also requires native data. Most large language models trained on internet data inherit the English internet's blind spots about India. They will tell you, confidently, that some town in Maharashtra is in Madhya Pradesh, that some politician was Chief Minister of the wrong state, that some folk tradition belongs to the wrong region. These errors are not random; they are systematic, the residue of a corpus that was English-internet-dominated. A model trained additionally on careful Indic data corrects most of these errors. A model designed natively for multilingual Indian use is qualitatively different.

The professional class will lead this

The professional class will lead this transition before the consumer market does, because the cost of getting it wrong is higher in professional contexts. An English-only AI is a charming consumer toy in Tier-2 India and a deadly tool in a clinic, court, or government office. Once Indian doctors, lawyers, and civil servants start integrating AI into their actual work, the demand for genuinely native multilingual tools becomes structural. They cannot accept a tool that mistranslates a patient's complaint or a defendant's testimony. The professional cohort is the forcing function.

This is one of several reasons that AI.Bharath.CLUB exists. The community of working professionals across Indian languages is the testbed, the feedback loop, and the labour pool for the multilingual AI that the country needs. Without that community, no model team, however well-funded, has the signal to know whether they have got the language right. With that community, the signal arrives every week, from chapters and tables and asks and stories.

The opportunity at the bottom of the language curve

The boring truth is that the most valuable multilingual AI work in India over the next decade will not be in Hindi, Tamil, or Bengali. Those are the easy languages, the ones the global labs have already noticed. The valuable work is in Bhojpuri, Maithili, Awadhi, Magahi, Marwari, Khasi, Mizo, Tulu, Konkani, and the dozens of other languages with millions of speakers and almost no training data. Each of these is a professional and economic ecosystem currently locked out of the AI stack. Building tools that serve them is not a charity exercise; it is the highest-margin, lowest-competition AI work available in India today. The 99x is sitting there, waiting.

Join the conversation

This essay is part of an ongoing community. If it resonated, the next step is to be in the room.

Join Bharath.club → Read more essays