The Model Card is a Lie, Bharath.club

Every major foundation model now ships with a model card. The card is a document, usually a few pages long, that describes what the model is, what it was trained on, what it is intended for, what its known limitations are, and what kinds of misuse it is not built to handle. Model cards were a genuine advance over the previous state of the world, in which models shipped with no documentation at all, and they should be appreciated for that. They are also, by themselves, woefully insufficient as a guide to how the model actually behaves once it is deployed in the real world, and the misalignment between what the card claims and what the model does is one of the quietest failures in the current AI stack.

The trouble is structural. A model card is written by the team that built the model. The team is honest, mostly, but it has incentives. It cannot test every distribution shift. It cannot anticipate every real-world deployment. It will, even with the best of intentions, produce a card that reflects the model's behaviour in the test conditions the team chose to run, not the model's behaviour in the field. The card is, by construction, an upstream artifact. What is needed downstream, and what almost nobody produces at scale, is a behaviour log, maintained by the practitioners who actually use the model, tracking what it does in their contexts over time.

The card cannot know

There are categories of model behaviour that the card simply cannot know about. The card cannot know how the model behaves on Bhojpuri code-mixed input from a small-town user typing on a four-inch keyboard. It cannot know how the model handles the specific phrasing that District Court lawyers in Maharashtra use to summarize a brief. It cannot know how the model performs on patient complaints described in the kind of mixed clinical-Marathi register that doctors actually encounter in OPDs. None of these are obscure conditions. All of them are common in real Indian deployments. None of them are in the card.

The reason is not laziness. The reason is that the card is written upstream, by people who have never sat in a Marathi OPD, and could not possibly have run the relevant evaluations even if they wanted to. The behaviour in these contexts is discoverable only by people who are in those contexts. Those people are not, today, organized in a way that allows their discoveries to accumulate, be shared, or be acted on. So the discoveries die on the ground where they were made, and the next deployer of the same model walks in blind.

The card knows what the team tested. The team tested what they could imagine. They could not imagine your context. The card is therefore, in your context, unverified.

What a behaviour log actually is

A behaviour log is a structured record of what a model did, in a given context, on a given input, at a given time. Each entry has: the model and version, the input (or its sanitized form), the output, the user-domain context, the assessment (was the output acceptable, marginally acceptable, or unacceptable, and why). Behaviour logs accumulate, in good faith, over months and years. A model's behaviour log, after a year of community use, contains more information about its real-world performance than the model card ever could.

The logs are most valuable when they are shared across deployers. A single hospital chain's log of how the model handled clinical notes is useful internally. Twenty hospital chains' logs, pooled and analysed, are diagnostic for the entire health sector. Multiplied across sectors and languages, a national-scale behaviour log network would be the most powerful piece of AI evaluation infrastructure any country could have. Today, no country has one.

Why this has not been built

It has not been built because behaviour logs require trust. Deployers do not want to share their failures publicly. Vendors do not want their models' failures aggregated and reported. Regulators do not want to be put in the position of acting on logs they did not generate. Everyone has reasons. The cumulative effect is that the most useful piece of post-deployment AI evaluation infrastructure remains unbuilt, while every party complains about not having the information.

This is exactly the kind of public-good problem that is solved by a neutral, member-driven community. A community of practitioners can host behaviour logs in a way that respects privacy (the inputs are sanitized, the deployers anonymized) and aggregates the signal (the model's drift across deployments is visible in pooled form). The community has no commercial stake in any single model. It has every stake in honest, accumulating, decision-useful evaluation. The structure aligns where the market does not.

Who reads the log

A behaviour log is most useful to four audiences. First, the deployers, they want to know whether to switch models or to add specific safeguards. Second, the procurement officers, they want evidence that supports a buy-or-skip decision. Third, the regulators, they want trend data, not anecdote, to inform policy. Fourth, the labs themselves, they want field signal that their internal testing cannot produce.

A well-maintained log serves all four, with the same data, presented differently. The deployers see use-case-specific summaries. The procurement officers see comparative tables across vendors. The regulators see drift, distribution, and risk indicators. The labs see specific failure modes that go into the next version's training. The information flow is the missing nervous system of the AI ecosystem.

What this looks like for India

In India, a community-maintained behaviour log network would have a specific shape. Sector-specific logs, started by domain communities, doctors logging health AI, lawyers logging legal AI, teachers logging educational AI. Language-specific logs, started by language communities, Tamil, Telugu, Bengali, Marathi, Hindi. Pooled, periodically, into a national-scale aggregate that is publicly summarized. Operated by a neutral home like Eval.qa, in conjunction with the relevant professional communities through AI.Bharath.CLUB and its sectoral chapters.

None of this requires exotic technology. The hardest part is governance, and governance is exactly what a community is good at. The second hardest part is the slow, patient accumulation of contributors, and that is also a community function. The technical pieces, secure submission, sanitization, aggregation, public reporting, are routine.

A small action this week

If you deploy AI in any professional context, start a log. Even a simple text file. Date, model, input (sanitized), output, your assessment. The cost is two minutes a day. The compound value, over a year of use, is enormous. Multiply by a community of practitioners, and the country has, for the first time, an honest picture of what its AI is doing.

The model card is a starting point. The behaviour log is the finishing point. The Indian AI ecosystem will be measured, in retrospect, by whether it built the latter while it still had time. The window to start is now. The infrastructure is mostly social, a community willing to share, honestly, what its tools are doing. Bharath.CLUB is one place where that sharing can happen. The card is the lie. The log is the truth. Build the log.

The Model Card is a Lie

The card cannot know

What a behaviour log actually is

Why this has not been built

Who reads the log

What this looks like for India

A small action this week

Join the conversation

Read next

¶The card cannot know

¶What a behaviour log actually is

¶Why this has not been built

¶Who reads the log

¶What this looks like for India

¶A small action this week

Join the conversation

Read next

The card cannot know

What a behaviour log actually is

Why this has not been built

Who reads the log

What this looks like for India

A small action this week