In a fintech office in Bengaluru last month, a team I spoke with proudly demonstrated that they had shipped seventeen LLM-powered features in the previous quarter. When I asked which of those seventeen had a formal evaluation suite, the answer was two. When I asked which of those two had been re-evaluated after the last model upgrade, the answer was zero. This is not a careless team. This is one of the better-resourced product teams in the country. And it is the rule, not the exception.
We have entered a period where the velocity of building AI features is decoupled from the velocity of understanding what we built. A junior engineer can wire a Retrieval-Augmented Generation pipeline to a domain corpus in an afternoon. The same engineer, asked to produce a defensible evaluation of that pipeline's behavior across the linguistic, regional, and adversarial space it will actually face, would need months. The build curve and the understand curve are diverging, and the area between them is debt. Quiet, compounding, dangerous debt.
How the Gap Compounds
Software debt is roughly linear. You write code that is hard to maintain, you pay interest on it later. Evaluation debt is not linear. It compounds, because every untested model interaction becomes the substrate for the next untested model interaction. A retrieval system that subtly mishandles Devanagari numerals feeds an agent that summarises across documents that feeds a reasoning layer that drafts customer responses. By the time the wrongness surfaces, it has been laundered through three abstraction layers, and you cannot easily attribute the error to any single component.
In a healthcare triage pilot I reviewed in a Tier-2 district hospital in Maharashtra, the system was right about 92 percent of the time on the team's internal benchmark. It was right about 71 percent of the time on a stratified sample of real patient queries collected over six weeks. The 21-point gap was not malice or negligence. It was simply that the benchmark had been built faster than the team's understanding of who their users actually were. The benchmark was a snapshot of last quarter's assumption about the population, and the population had drifted, as populations do.
Why Indian Deployments Magnify the Problem
Indian AI deployments stress models along axes that Western evaluation suites largely do not anticipate. Code-switching between English, Hindi, and a regional language inside a single utterance. Transliterated input that looks like English but is phonetic Marathi or Tamil. Domain knowledge that lives in PDFs scanned at 150 DPI by a clerk in 2014. Users who treat the chatbot as a relative, not a form. Each of these is a dimension along which the model's behavior is undefined unless someone has explicitly measured it.
The MeitY-backed AI safety institute proposals, the early work emerging from academic eval labs in Hyderabad and Chennai, and the quiet eval teams inside the larger Indian fintechs are all gestures in the right direction. But they remain small islands in an ocean of unevaluated production traffic. We have, at a rough community estimate, more than ten thousand production AI features in active Indian deployment today. We have perhaps a few hundred engineers whose full-time job is to evaluate them.
The Hundredfold Asymmetry
The asymmetry is not five-to-one or ten-to-one. By any reasonable accounting of person-hours per feature, building is between fifty and a hundred times faster than evaluating, when evaluating is done properly. Properly here means: representative test sets, adversarial coverage, regression suites that run on every model update, behavioral drift monitors, human-in-the-loop spot checks, and a process for incorporating real production failures back into the test corpus.
Almost no one does all six. Most teams do one, sometimes two. The ones who do three consider themselves industry-leading, and they are correct, which is itself a damning indictment of the industry.
What Closing the Gap Looks Like
Closing the gap does not mean slowing down building. It means industrialising evaluating. The same way that India's services economy industrialised software testing in the 2000s, turning what was an artisan activity in San Jose into a disciplined, repeatable process delivered at scale from Pune and Chennai, the next decade requires us to industrialise the evaluation of AI behavior. The skill profile is adjacent but distinct: it requires linguistic fluency, domain depth, statistical literacy, and a willingness to sit with ambiguity that traditional QA did not require.
There is a labor advantage here that India should not waste. The same critical, curious, English-fluent workforce that built the global BPO industry is uniquely positioned to become the global AI evaluation workforce. But this requires us to first acknowledge, at a community and policy level, that evaluation is a discipline, not an afterthought. It needs career paths, tooling, certifications, and respect.
The Action
If you ship AI features and you do not yet have an evaluation engineer on your team, hire one before you ship the next feature. If you have one, give them veto power over deploys. If you are early in your career and looking for asymmetric leverage, learn evaluation deeply: it is the single most underpriced skill in Indian AI right now, and the gap between supply and demand will not close for at least a decade. Build less this quarter. Understand more. The compounding works in your favor only if you start now.
Join the conversation
This essay is part of an ongoing community. If it resonated, the next step is to be in the room.
Join Bharath.club → Read more essays