When a frontier lab releases a new model, they ship it with a model card. The model card lists the training data at a high level, the benchmarks the model was evaluated on, the safety mitigations applied, and the intended use cases. Model cards are useful. They are also, in a deep sense, the wrong artifact, because they describe what a model is supposed to do. The artifact we need is one that describes what the model actually does in the messy environments we deploy it into. That artifact is the behavior log, and it does not yet exist at the scale and quality that Indian AI requires.
A behavior log is a community-maintained, structured record of how a given model behaves on the specific inputs that matter to the community using it. Not the inputs the lab chose for its model card. The inputs that arrive in a fintech support queue in Mumbai, an agricultural query system in Punjab, a legal drafting tool in Delhi, a triage line in a Pune municipal hospital. The behavior log captures, for each notable input pattern, what the model produces, how often it produces something acceptable, how often it produces something concerning, and how this has changed over time as the model has been silently updated by the provider.
Why Model Cards Are Not Enough
A model card is produced once, by the entity with the most incentive to make the model look good. The benchmarks chosen are public benchmarks that the lab is comfortable competing on. The failure modes disclosed are the failure modes that have been studied. The use cases described are aspirational.
None of this is dishonest. It is, however, structurally insufficient for an Indian deployer who needs to know how a model behaves on Marathi medical terminology, or on transliterated Tamil customer queries, or on hand-filled forms scanned at angles. The lab has not tested any of this. The lab cannot test all of this, because the surface area is too large and is owned by the deployers, not the lab. The behavior log is the place where that ground-level knowledge accumulates.
What a Useful Behavior Log Contains
A behavior log entry is not a tweet that says the model is bad. It is a structured artifact. At minimum, it includes: the model and version, the input pattern, several representative examples, the observed outputs, the rate at which the problem occurs over a meaningful sample size, the date observed, and ideally a reproducible test case that anyone can run.
A good behavior log distinguishes between drift, where the model's behavior on a given input changes over time, and discovered behavior, where the behavior was always there but no one had probed in the right place. It distinguishes between failures that are statistically rare but catastrophic, and failures that are common but minor. It captures the context of deployment, because a behavior that is acceptable in one context might be unacceptable in another.
The LMSYS Lesson, Indianised
The community-organised model evaluation infrastructure that emerged internationally, the LMSYS arena, the open eval leaderboards, the various open-source benchmark consortia, demonstrated that distributed, community-driven evaluation can produce signal that no single lab can produce alone. India needs an analogous infrastructure, but with a few important differences.
First, the inputs that matter for Indian deployments are not the inputs that matter for the global arena. We need evaluation pools that reflect Indian languages, Indian document types, Indian dialects of professional discourse, and Indian regulatory contexts. Second, the contributors need to come from the actual user populations, not just from a tech-literate diaspora. Third, the artifacts need to be in formats that Indian regulators and Indian deployers can actually use in audits, not just in academic papers.
Some of this is already starting. There are nascent community efforts to build Indic-language evaluation suites at AI4Bharat and adjacent groups. There are independent practitioners maintaining private behavior logs for the models they deploy on behalf of clients. There are subreddits and Slack groups where deployers compare notes. The pieces exist; they are not yet connected.
Behavior Logs as Civic Infrastructure
Consider what it would mean for a state government procuring an AI-assisted welfare scheme eligibility system to have access to a behavior log for the candidate model. They could see how that model has handled name variants, address formats, and document scans in the actual deployments it has been used in over the past eighteen months. They could see the failure rate on left-handed scribbled signatures. They could see whether it confuses similar village names in their district. None of this is in the model card. All of it should be in a behavior log.
This is the kind of infrastructure that does not get built by a single company because no single company captures its full value. It gets built by a community, with the support of public institutions, by people who understand that an evaluated model is a civic asset and an unevaluated model is a civic risk.
The Action
If you deploy a model in production, start a behavior log this week. It does not need to be public. Just structured, dated, and reproducible. If your organisation deploys multiple models, designate one engineer as the keeper of the behavior log. If you are part of a community of practitioners, and Bharath.CLUB is one such community, contribute the entries you can to a shared pool. The model card is the lab's story about its model. The behavior log is your story about what that model did to your users. Only one of those is going to matter to a regulator in three years, and it is not the lab's.
Join the conversation
This essay is part of an ongoing community. If it resonated, the next step is to be in the room.
Join Bharath.club → Read more essays