Grading World Models

testing beliefs against reality

Jun 05, 2026

Suppose you are bitten by a rattlesnake. Would you rather be treated by a medical toxicologist, or by a faith healer who chews up herbs and spits them into the wound?

Because snake bites usually aren’t fatal, most patients survive either way. In noisy domains, almost anyone can claim some surviving patients. The question is not whether you can survive a bad process. The question is which model actually improves your odds.

In public reasoning, we choose the faith healer constantly. Most people form beliefs using heuristics like narrative fit or social identity that were good enough for natural selection. This might work for navigating social environments, but it isn’t suited for developing accurate world models. Without external feedback loops, it’s too easy to reinterpret contrary evidence and explain away failures. The deeper problem is that most people have no reliable mechanism for even noticing when they’re mistaken.

The Exam

A story can be crafted to explain anything after the fact, and a theory that doesn’t make falsifiable predictions can survive indefinitely. Forecasting brings the right incentives for beliefs to move from trying to sound good to trying to be predictive.

When I’m teaching, I want my students to do well, but I also want them to comprehend the material. I can’t just give them the exam key to study from or most of them would just overfit to the test. The purpose is to find out if they understand the material well enough to solve new problems.

World models should be judged the same way.

A model that only explains the past may be nothing more than a story fitted to the answer key. A model that anticipates new cases has captured something real. Forecasting demonstrates whether a world model holds up outside the training set.

Forecasting is an exam for world models. A strong track record is compressed evidence about whose beliefs are disciplined by reality. Top forecasters have demonstrated that their expectations are aligned with how the world actually is.

Top Performers

If forecasting is the exam reality gives to world models, forecasters who consistently outperform the crowd are the students at the top of the class.

Marginal advantages matter. An investor does not need to beat the market every day to have skill. Similarly, a good forecast usually isn’t “the crowd is at 20%, but I alone know it’s 100%.” More often, the advantage looks like slightly better calibration over time.

Forecast scoring separates people with reality-shaped models from the rest. These people have demonstrated one of the rarest and most valuable intellectual skills: the ability to form beliefs that reality vindicates.

The Leaderboard

Building a meritocratic leaderboard was the promise of forecasting platforms.

They would identify people with unusually good judgment, measure their performance over time, and make that talent useful. The idea was to discover people whose beliefs repeatedly made contact with reality, then route them toward decisions where reality-contact mattered.

On the first part, forecasting platforms largely succeeded.

They showed that probabilistic judgment is not just noise. Some people consistently outperform others. They are better calibrated, better at using base rates, better at updating, and less likely to let a preferred story dominate their expectations. Across many questions, these small advantages compound into a meaningful edge.

The Opportunity

The ideas from the forecasting community have started to take hold, but the market for truth is still smaller than the market for reassurance, and many of the most important concepts haven’t been widely adopted yet.

In many cases, the faith healer model limps along simply because of inertia. When you combine entrenched institutional advantages with the natural appeal of a good narrative and drop it all into a noisy world, the faith healer can survive a very long time before anyone starts questioning their track record.

A compelling story will always be easier to absorb than a calibrated probability. But that blind spot is exactly what makes the opportunity so valuable.

Forecasting platforms have identified people whose beliefs have been repeatedly tested against reality. The next step is to route that judgment toward decisions where being wrong is expensive, giving organizations access to something public reasoning usually lacks: a measured record of who is more likely to be right.

Abstraction

Discussion about this post

Ready for more?