Blog·Wisdom-First AI·No. 091 / 132

Wisdom-First, Not Data-First

The default recipe for building AI in 2026 is still the same as in 2022: scrape more, filter less, hope the loss curve bends. We think that recipe is the problem.

940
words
4m
read time
5,600
characters
15
paragraphs
49
sentences
W
signature
Wisdom-First, Not Data-First
Wisdom-First AI · Essay 091 of 132

The standard recipe for building a frontier model in 2026 still reads like a procurement order: acquire ten trillion tokens, filter for duplicates, train for six months, ship. What you get back is a model that has read every Reddit thread about Indian visas, every English-language news article about Indian elections, and almost nothing written by an Indian for an Indian about how to actually do a job in this country. That model is then deployed in Bengaluru, Patna, Coimbatore, and the people using it learn, slowly, that it is confidently wrong about most things that matter to them.

Wisdom-first AI begins with a different question. Not how much can we scrape, but what is worth knowing. Not what is on the public internet, but what is in the head of a senior tehsildar in Chittoor who has handled land-records disputes for twenty-two years. Not what an English-language SEO farm wrote about diabetes, but what an endocrinologist at AIIMS Delhi has learned about treating Type 2 in a population with a different metabolic profile.

The shift is from corpus to curriculum.

The data-first dead end

When you train on scraped data, you are training on the median voice of whoever wrote most aggressively on the open web. For most professional domains in India, that voice is foreign, English, and shallow. A model trained this way will tell you, in fluent English, that the standard treatment protocol for X is Y, where Y is the American protocol. It will not tell you that the Indian Council of Medical Research guideline is different, because the ICMR guideline is in a PDF that nobody trained on.

This is not a small bug. It is the central failure of data-first AI when it touches Indian professional contexts. Lawyers find that the model cites overturned American precedent. Doctors find that it suggests drug combinations that ignore the Indian generic landscape. Farmers find that it recommends crops that need three times the water available in their district. The model is fluent, confident, and wrong in ways that take expert effort to detect.

Adding more scraped data does not fix this. It deepens it. The internet is not a neutral substrate; it is a sample, and the sample is biased toward the loudest, the most monetized, and the most foreign.

What wisdom-first means in practice

Wisdom-first does not mean smaller. It means differently organized. The training pipeline starts with curation: which sources are authoritative for which domain, who decides, how the decisions are recorded. It treats the inclusion of a document the way a serious library treats acquisition, as a decision with provenance.

The shift is from corpus to curriculum, from crawl to curation, from a model that has read everything to a model that has read the right things.

For an Indian agricultural assistant, this means starting with ICAR research papers, KVK extension notes, state agricultural university handbooks, and field interviews with seasoned officers from the Department of Agriculture. It means transcribing the Marathi conversations between extension workers and farmers in Vidarbha, and weighting that transcript as heavily as a thousand English blog posts. It means asking, before you train, whose knowledge would a wise system reflect.

For a legal assistant in India, it means the Supreme Court reports, the All India Reporter archives, the Law Commission of India recommendations, and the working notes of senior advocates who have agreed to share annotated case histories. It does not mean StackExchange Law, which is mostly American and mostly free legal advice from non-lawyers.

The discipline is not glamorous. It is the work of librarians, archivists, and domain experts, sitting with engineers and deciding what counts. It is slow. It scales linearly with human attention rather than exponentially with bandwidth. This is the trade we are arguing for.

Why this is the Indian advantage

India has something the data-first world does not: living institutional knowledge that has not yet been turned into bad blog posts. The senior officials of the IAS, the engineers at ISRO, the doctors at the national institutes, the lawyers at the bar councils, the agricultural scientists at the ICAR network, these people hold knowledge that is not online and largely will not be, unless someone curates it deliberately.

That gap is, paradoxically, our edge. Building an Indian AI on top of foreign scraped data is a losing race. Building it on top of curated Indian wisdom is a race nobody else can run. The Tamil municipal records of the last forty years, the bilingual case files of the district courts, the agricultural extension archives in Kannada and Telugu, this is the corpus that does not yet exist as training data. Whoever assembles it carefully owns a generation of useful AI in India.

What we are asking you to do

If you are building an AI product for India, stop treating the foundation model as the moat. The foundation model is a commodity. The curated wisdom layer on top of it is not. Spend the money you would have spent on a slightly bigger model on a smaller, sharper corpus that actually reflects how your domain is practiced in India.

If you are a domain expert, a doctor, a lawyer, a senior administrator, a teacher, your annotated knowledge is now infrastructure. Find a project that is curating wisdom in your field and contribute, on terms that respect your work. The systems that will matter in India in 2028 are being seeded by the curation decisions of 2026.

Wisdom-first is not a slogan. It is a procurement instruction. Buy curation before you buy compute.

Join the conversation

This essay is part of an ongoing community. If it resonated, the next step is to be in the room.

Join Bharath.club → Read more essays