Blog·Evaluation & Reliability·No. 084 / 132

The Demo-to-Deployment Cliff

A demo runs once under conditions you chose. A deployment runs a million times under conditions chosen by reality. The work that bridges them is not engineering. It is evaluation, and it is the hardest thing in the industry.

973
words
4m
read time
5,742
characters
13
paragraphs
55
sentences
T
signature
The Demo-to-Deployment Cliff
Evaluation & Reliability · Essay 084 of 132

Walk through the AI conference circuit in Bengaluru, Hyderabad, or Mumbai over the last eighteen months and you will see the same arc repeat. A startup or enterprise team demonstrates a system that summarises court judgments, or screens loan applications, or answers crop queries in Marathi. The demo works. Executives nod. A press release is drafted. Eight months later, the system is either quietly retired or running in a degraded state that nobody talks about. The Boston Consulting Group, MeitY, and several Indian VCs have, at various points, put the production failure rate of AI proofs of concept at somewhere between 80 and 95 percent. The honest number, in the Indian context, is closer to the higher end.

This is not because Indian engineers are worse. They are not. It is because the work between demo and deployment is fundamentally different from the work that produced the demo, and almost no team is staffed or budgeted for it.

The Demo Population Versus the Production Population

A demo is a curated event. The inputs are selected. The model is warm. The reviewer is friendly. The environment is controlled. A deployment is none of these things. The inputs arrive from a user base whose composition you only partially understand. The model is one of many running concurrently. The reviewer is a frustrated user who will not give you a second chance. The environment includes flaky networks, malformed PDFs, and queries written at 2 a.m. by someone who has been on hold for an hour.

In an agriculture extension pilot I observed in a district in Karnataka, the demo to the agriculture department used eight curated farmer queries. The deployment, in its first week, received over forty thousand queries from farmers across three districts. Twelve thousand of those were in code-switched Kannada-English. About fifteen hundred were voice notes that the speech-to-text layer struggled with. Several hundred were photographs of pest damage that the multimodal model had never been evaluated on. The demo had not failed. The demo had simply never been the thing.

The Million-Times Problem

A demo runs once. A deployment runs a million times. This sounds obvious until you internalise what it implies. If your demo system is correct 95 percent of the time, your deployment will produce fifty thousand wrong outputs per million queries. If those wrong outputs are diffuse and low-stakes, you might survive. If they cluster, as they almost always do, around a particular dialect, or a particular type of document, or a particular demographic, you have created a systematic harm that will eventually surface in a regulator's inbox or a journalist's column.

The work between a demo and a deployment is almost entirely evaluation, and it is the work that almost no one budgets for, staffs for, or respects.

The cliff between demo and deployment is where evaluation work lives. Real evaluation: stratified test sets that mirror your actual user distribution, adversarial probing for the failure modes you know about and a budget for discovering the ones you do not, latency and cost profiling under realistic load, regression suites that catch silent degradation when you switch models, and a feedback loop that brings real production failures back into the test corpus weekly, not annually.

What Indian Teams Get Wrong at the Edge

Three patterns recur across the Indian deployments I have seen fail at the cliff.

First, the team treats evaluation as a phase, not a discipline. They allocate two weeks at the end of the build cycle to write some tests, run them, and declare victory. Evaluation is not a phase. It is a parallel function that runs from the first day of the build to the last day of the deployment.

Second, the team conflates accuracy with usefulness. A judgment summarisation system can be 90 percent accurate by some token-overlap metric and still be useless to a junior advocate who needs to know whether the precedent applies to her client. Useful is a much harder target than accurate, and only domain users can tell you when you have hit it.

Third, the team has no story for distribution shift. The model that worked in October will be re-trained or replaced by January. The user base that existed in October will have grown and diversified by January. No one is tracking either. By March, the system is silently worse than it was at launch, and no one knows because no one is measuring.

The Civic AI Case

Consider the civic AI deployments that have begun appearing in Indian municipal governments, grievance routing systems, scheme eligibility chatbots, document drafting assistants for clerks. These are some of the most consequential AI deployments in the country, because the user is often a citizen with limited recourse if the system fails her. They are also, on average, some of the least well-evaluated, because the procurement processes that surround them treat the AI as software, and software procurement evaluates features, not behavior.

Bridging this gap is not glamorous work. It does not produce press releases. It produces, instead, a system that is still working correctly in its eighteenth month, which is the only thing that should matter.

The Action

If you are about to take an AI proof of concept to production, do this first: define ten failure modes you are willing to be responsible for, and ten you are not, and build evaluations for all twenty before you ship. If you are a buyer of AI systems, demand to see the evaluation suite, not the demo. If you are an investor, ask the team what their production-failure-feedback loop looks like, and watch their face. The cliff is real, but it is also crossable, and the teams that learn to cross it will own the next decade of Indian AI.

Join the conversation

This essay is part of an ongoing community. If it resonated, the next step is to be in the room.

Join Bharath.club → Read more essays