When test-driven development arrived in Indian engineering culture in the mid-2000s, the response was uneven. The first wave of teams adopted it ritualistically and produced bloated test suites that tested the wrong things. The second wave rejected it as Western dogma that did not fit their delivery model. The third wave, slowly, came to understand what TDD was actually about: not writing tests first as a ceremony, but using the act of specifying a test as the act of clarifying what you intended to build. By 2015, the best Indian engineering teams were doing some form of TDD without calling it that, and their code was better for it.
We are at the beginning of an analogous arc with what I will call eval-driven development. The premise is simple. Before you write the prompt, the chain, the agent, the fine-tune, you write the evaluation that will tell you whether what you build does what you wanted. You write a set of representative inputs, the outputs you would consider acceptable, the outputs you would consider wrong, the outputs you would consider catastrophic, and the way you will score the distance between any given output and these reference points.
You do this before you write a single line of model-facing code. And then everything you build is a candidate solution to a problem you have already, in writing, defined.
Why This Is Harder Than TDD
In classical TDD, a test was usually deterministic. Given an input, the function either returned the expected output or it did not. The cost of writing the test was a few minutes; the test then ran in milliseconds, forever.
In eval-driven development, the test is almost never deterministic. The model is probabilistic. The acceptable output is a set, not a point. The scoring function is often itself a model, with its own failure modes. The evaluation can take seconds per example and rupees per run. Writing a good eval is itself a substantial engineering effort, sometimes larger than writing the system being evaluated.
This is why almost no one does it. It is hard, expensive, and produces no immediately visible product. It is also, when done properly, the single highest-leverage activity in the AI engineering process, because every subsequent decision, model choice, prompt structure, retrieval strategy, fine-tuning data, deployment threshold, collapses from a debate to a measurement.
The Bengaluru Document-Intelligence Case
I worked through this with a team building a document intelligence layer for an Indian logistics company that processes a few hundred thousand shipping documents a month. The team had spent six weeks tuning their extraction prompts. They had three engineers running parallel experiments, comparing outputs by eyeballing. Progress was real but unmeasurable. When two engineers disagreed about which version was better, the argument was unwinnable because there was nothing to point at.
We paused for ten days. The team built a corpus of three hundred representative documents, with structured ground truth for the fields that mattered. They built a scoring harness that ran in about twenty minutes end-to-end. From that point forward, every change to the prompt, the retrieval, or the model produced a number. The arguments stopped. The pace of real improvement, measured by that number, went up by roughly a factor of four over the next quarter. Two months later, when a major model upgrade dropped, the decision of whether to migrate took an afternoon instead of a fortnight, because they ran the eval and read the number.
Eval-Driven Development Is Not Just More Tests
It is tempting to read this as "write more tests for your AI." That is not what it is. The shift is deeper. In eval-driven development, the evaluation suite is the specification. The system being built is a candidate implementation of that specification. This inverts the usual relationship between code and tests. The tests are primary. The code is contingent.
This matters because in AI engineering, the code is going to change underneath you constantly. Models will be upgraded. Providers will deprecate. Prompts will be rewritten. Architectures will be replaced. The only stable artifact across all of this churn is the evaluation suite, because the evaluation suite encodes what you were trying to do, which does not change as fast as how you do it.
Why Indian Teams Should Adopt This Now
Indian AI teams are at a particular inflection point. Many are building their second or third AI product, having shipped their first one in a hurry without evals and now living with the consequences. The pain of un-evaluated production is now widely felt. The next wave of products being designed in early 2026 has an opportunity to adopt eval-driven development from day one, and the teams that do will compound an advantage that the previous wave cannot easily catch up to.
There is also a regulatory tailwind. The RBI, IRDAI, and the emerging AI safety frameworks being discussed at MeitY all point toward a future in which Indian companies will be asked to demonstrate that their AI systems behave as claimed. The team that has been doing eval-driven development for two years will produce that demonstration in a week. The team that has not will spend a year retrofitting evaluations onto systems whose original specifications were never written down.
The Action
On your next AI feature, do not write the prompt first. Write the eval first. Spend a week on it. Resist the colleague who tells you this is overkill. Then build, and let the eval do the arguing. If you are leading a team, make eval-first the default for every new AI feature starting next sprint. The discipline barely exists today. The discipline will define the next five years.
Join the conversation
This essay is part of an ongoing community. If it resonated, the next step is to be in the room.
Join Bharath.club → Read more essays