In late 2024, when a particular open-source coding model became briefly fashionable, the benchmark scores were excellent. The HumanEval numbers were near state-of-the-art. The MMLU scores were respectable. By every conventional measure, this was a model that should have been near the top of any deployer's shortlist. Within three weeks of release, a quiet consensus emerged among practitioners actually using it for production code: it was clever on toy problems and brittle on real ones. It hallucinated import statements. It misunderstood context boundaries. It was, in the language of the people who had to live with it, not good.
No benchmark surfaced any of this. A benchmark measures performance on a fixed test set. The qualitative shape of how a model behaves across the irregular surface of real work is something only a community of users can perceive, and only over time.
This is the central thesis: the most reliable signal of AI quality is not a metric. It is a community of practitioners who have used the system long enough to develop a shared sense of its shape. We have not built the infrastructure for that community in India. We urgently should.
Why Metrics Are Insufficient
A metric is a compressed representation of system behavior. It takes a high-dimensional reality, the model's performance across an enormous space of possible inputs, and projects it onto a single number. The projection is lossy by definition. What is lost depends on what the metric chose to keep, and a small change in what the metric keeps can produce a large change in what it shows.
This is fine when the use case is narrow and the input distribution is well-characterised. It is dangerous when the use case is open-ended, as most LLM use cases are. The 92 percent score on a benchmark tells you almost nothing about whether the model will perform well on your particular workload, because your workload is not the benchmark.
Communities Aggregate Soft Signal
A community of practitioners aggregates a different kind of signal. Each member has used the system in a slightly different way, encountered slightly different failure modes, developed slightly different intuitions about when to trust it and when not to. When they compare notes, in Slack channels, at meetups, in code reviews, they triangulate something no individual member could see alone.
This is not new. Software developers have always relied on communal knowledge: which library is actually reliable, which tool's documentation lies, which database has a sharp edge the manual does not mention. The collective wisdom of a developer community is the most reliable quality signal in software, because it has been beaten on by reality, by many people, in many ways. AI systems need this kind of signal more than traditional software did, because their behavior is more probabilistic and the failure surface is larger.
The Indian Specificity
The communities that matter for Indian AI quality cannot be borrowed wholesale from elsewhere. A practitioner community in San Francisco has, structurally, no exposure to how a model handles Hindi-English code-switching, or transliterated Tamil, or the legal-document patois of an Indian magistrate's court. They cannot generate the signal we need, however excellent their work on other axes is.
We need Indian practitioner communities for Indian AI quality, structured around the deployment domains that matter here: fintech, healthcare, agriculture extension, civic services, education, legal services. Each of these has a population of practitioners working with AI systems mostly in isolation. They are accumulating exactly the kind of qualitative knowledge that, if pooled, would be a national asset. Almost none of it is being pooled.
What Pooling Looks Like
Pooling requires three things. A trusted forum where practitioners can compare notes without commercial conflict. A light convention for how to describe observations, input, output, expected, actual, context, frequency, so they are comparable. And a community ethic that prizes honest negative signal as much as positive testimonials.
The third is the hardest. The default mode of much of the AI conversation, including in India, is promotional. People announce wins. People do not announce that the model they have been using for three months has a failure pattern in Bengali numerals, because announcing this feels disloyal to a tool they are otherwise relying on. We have to change this norm. The disloyalty is to the next deployer, the next user, the next regulator who will be misled by the silence.
Imagine a community of one hundred and fifty Indian doctors in Tier-2 and Tier-3 cities using an AI clinical decision support tool. They share, monthly, the cases where the tool was wrong and the cases where it was usefully right. The aggregated record is available, with proper anonymisation, to the next hospital evaluating adoption. The quality signal would dwarf any benchmark, because it would be grounded in real clinical decisions in the actual deployment context. Several professional associations in India already have the membership and trust to host such communities. What is missing is deliberate construction of the feedback loop, and the cultural permission to be publicly honest about what is and is not working.
The Action
If you use an AI system in your professional work, find or build the community of people who use it in similar contexts. Share what you observe. If you build an AI system, treat the community of users as your primary quality signal and your benchmark as a sanity check, not the other way around. If you are part of Bharath.CLUB, this is exactly the kind of infrastructure the community can build. A metric is a number. A community is a chord, and chords carry more information than any single note ever can.
Join the conversation
This essay is part of an ongoing community. If it resonated, the next step is to be in the room.
Join Bharath.club → Read more essays