Bharat's Foundation Model Question, Bharath.club

The conversation about whether India should build a foundation model has been stuck in a binary for two years. On one side, the position that India must build a frontier-scale model to remain sovereign in AI. On the other, the position that doing so is a vanity project and the country should instead use available open models. Both positions are partially right and largely missing the actual question, which is more specific and more useful: what is the smallest, most carefully Indian model that produces a measurable improvement in the work of Indian professionals, and what would it take to build it?

The frontier-scale framing imports the wrong reference point. It assumes that what matters is being able to compete with GPT or Gemini on standardized English benchmarks. That competition is, for all serious purposes, settled, the closed labs have multi-billion-dollar training budgets and a several-year head start, and the relevant gap is not closeable on any reasonable Indian budget. Importing the frame from that competition leads to projects priced for vanity and shipped for press releases. The actual question is different.

What "useful" means here

A model is useful if it does the work Indian professionals need to do, in the languages and registers they use, with the cultural context they operate in, at a cost the country can sustain. Most of the work Indian professionals need to do does not require a hundred-billion-parameter model. It requires a competent model in their language, with their cultural awareness, with proper handling of code-mixing, with grounding in Indian legal, medical, agricultural, and educational realities. A well-trained seven-billion-parameter model on careful Indian data can do this work better, for the relevant set of tasks, than any foreign frontier model can.

This is a deeply unfashionable claim because the press cycle has trained everyone to equate "better" with "bigger." The press cycle is mistaken about this for a specific reason. The performance of a model on a task depends much more on the quality of its training data for that task than on its sheer parameter count. A model with five times the parameters but trained on data that is essentially absent in your domain will lose, in your domain, to a smaller model trained well on the domain. This pattern holds across every benchmark a serious Indian team has run on Indic tasks. The 7B model trained well wins.

Parameter count is the wrong axis. Domain coverage in training data is the right axis. A 7B model trained on careful Indian data beats a 70B model trained on the English internet, for Indian work.

The hundred-million-rupee version

A serious Indic foundation model program at the seven-billion-parameter scale, built deliberately, with proper data curation, evaluation, and deployment infrastructure, is approximately a hundred-million-rupee program over three years. That includes the data work, the training compute, the evaluation pipeline, and the operational stack. It does not include the marketing budget, which should be small, because the model proves itself by being used.

A hundred million rupees is a serious sum, but it is not unreasonable for a national infrastructure project of this importance. It is less than the country spends on a single highway interchange. It is less than a single mid-sized fundraise by an Indian startup. It is well within the budget of a public-private consortium operating with a clear mandate. The financing is not the bottleneck. The clarity of purpose is.

The hundred-billion-rupee trap

The hundred-billion-rupee version, by contrast, is the trap. That is roughly what it would cost to attempt a credible frontier-scale model, hundreds of thousands of GPUs, multi-year training cycles, the full complement of frontier research talent. The country can afford it, in principle. The country cannot afford to pour that scale of investment into a project that will, at the end, produce a model that is twelve months out of date by the time it ships and that solves the wrong problem anyway. The opportunity cost is enormous. The same hundred billion rupees, distributed across data curation, evaluation infrastructure, deployment subsidies, and a workforce training program, would produce far more durable national AI capacity.

This is not an argument against ever doing larger-scale model work. It is an argument for sequencing. Build the small Indic model well, prove the deployment loop, train the workforce that can do this kind of work, and then, once the loop is producing reliable national capability, consider the next-scale ambition. Sequencing matters. Doing the harder thing first, on a budget designed for show, is the failure mode that costs countries decades.

What the data work looks like

The data work is the most important, least glamorous, most under-resourced piece of any serious Indic model project. It is, broadly: collect a wide corpus of high-quality Indian-language text from books, journals, government records, and curated web sources; clean and de-duplicate it carefully; build conversational and reasoning datasets that capture how Indians actually think and talk in each language; build domain-specific datasets for the priority sectors (health, legal, agriculture, education, civic); pair the data work with serious evaluation harnesses that test for the things the model will be asked to do.

This is years of careful work by linguistically literate, domain-aware, technically competent teams. It is exactly the kind of work that India is well-positioned to do and that no foreign lab will do for the country. The people who can do it well are scattered across the country's universities, professional communities, and language-specific groups. Coordinating them is one of the most important roles a national-scale community can play.

The deployment loop

The model is half the work. The other half is the deployment loop, the infrastructure that takes the model from a set of weights to a tool that an Indian professional can actually use. This includes the inference infrastructure, the fine-tuning workflows for domain customization, the evaluation harnesses for ongoing monitoring, the user feedback mechanisms, and the training programs for the workforce that will deploy it.

A serious Indic foundation model program designs the deployment loop from day one. It does not treat deployment as something that happens after the model is "done," because the model is never done in any meaningful sense; it is continuously improved by feedback from the deployment loop. The loop is the model's nervous system. A model without a deployment loop is an experiment, not infrastructure.

The role of the community

The community's role in this is specific and crucial. It is the data partner, providing the linguistic and domain expertise that the model needs. It is the evaluation partner, testing the model in real working contexts and feeding back the results. It is the deployment partner, taking the model into chapters, tables, and sectors across the country. It is the political partner, making the case for the model's continued investment to procurement officers, policymakers, and the press.

Bharath.CLUB, AI.Bharath.CLUB, Sarasvat.ai, and Eval.qa are pieces of this community-and-tooling architecture. None of them, alone, can produce the Indic foundation model. Together, with a clear mandate and a modest budget, they can produce something more durable than the foundation model itself, the national capacity to keep making foundation models that work for Indians, year after year, as the technology evolves. That capacity is what sovereignty actually means in this domain. The seven-billion-parameter model is the first deliverable. The capacity is the prize.

Bharat's Foundation Model Question

What "useful" means here

The hundred-million-rupee version

The hundred-billion-rupee trap

What the data work looks like

The deployment loop

The role of the community

Join the conversation

Read next

¶What "useful" means here

¶The hundred-million-rupee version

¶The hundred-billion-rupee trap

¶What the data work looks like

¶The deployment loop

¶The role of the community

Join the conversation

Read next

What "useful" means here

The hundred-million-rupee version

The hundred-billion-rupee trap

What the data work looks like

The deployment loop

The role of the community