Forecast Scoring Methods

which method is best for identifying top forecasters?

Nov 07, 2023

In our quest to identify top forecasters, it’s important to distinguish between different methods of scoring and their implications on forecasting behavior. The nuances of market-based scoring and reputational scoring reveal distinct challenges and benefits that can significantly influence a forecaster's approach and success. With market-based platforms, it’s often difficult to deconvolve forecasting acumen from trading strategy, while reputationally-incentivized sites make this easier since that’s essentially their entire focus. This article looks at the intricacies of these methods, particularly examining Brier and Log scoring systems to understand how they shape the forecasting landscape and the identification of top-tier forecasters.

Market-Based Scoring

In theory, if a forecaster could wager enough to move markets to ground truth odds, markets would approximate proper scoring rules, but, in reality, most of the time, individuals are price-takers and their bid isn’t enough to move the market significantly.

Unfortunately, this means that identifying top forecasters from market-based trading platforms is more complicated than simply looking at winnings since an important aspect of doing well is less about forecasting and more about trading.

If betting markets are priced very close to the ground truth then the expected value of making a bet is $0 and there are good reasons to expect this to be the case a lot of the time since smart money is very good at capitalizing on inefficiencies.

For example, suppose a betting market has a ground truth probability for 99% of resolving true and the market is currently sitting at an odds ratio of 99:1.

If an overconfident gambler decides to try their luck that the event will happen the can expect to win 99 times out of 100 and lose once. Here is what the expected value of the bet works out to be:

$99 * $1 - 1 * $99 = $0$

In the long-run, an efficient market prevents the gambler from losing money and they simply break even (assuming no fees / friction for our hypothetical example).

In a comment from Scott Alexander’s Superconductor Autopsy, Ted Sanders writes:

As someone who's spent years trading on prediction markets, I can strongly confirm that returns are mostly driven by who can take advantage of idiots the most quickly rather than who is perfectly calibrated and accurate. Programmers who build bots that either notify them or auto-trade have a huge edge over regular people who just check in once a day or once a week. Probably not too dissimilar to the stock market, where high frequency market makers earn tons of money despite having very little to say on what companies are actually worth.

I should mention that I know some of the top performers on Manifold to also be terrific forecasters, but it’s just harder to make an apples-to-apples comparison between participants on market-based platforms. Because of this, I generally prefer to see performance on reputationally-incentivized forecasting sites.

Reputational Scoring

Reputational platforms are predominantly dominated by 2 scoring methods: Brier and log scoring. This is going to be a somewhat controversial view, but Brier scoring and log scoring are both good scoring systems under the right scenarios. Both follow proper scoring rules in the sense described previously and have situations where they shine. As per the scoring rules article, it should be clear that the optimal strategy for achieving the best score is by always forecasting as close to the ground truth as possible, but now we’ll dive deeper and look at how, if at all, the directionality of the forecast error impacts score.

Error Directionality

In real-world decision-making, the asymmetry between the consequences of overconfidence and underconfidence often tilts the scale toward the latter being the lesser of two evils. This is primarily due to the principle of diminishing returns and the non-linear impact of losses as they approach total depletion of resources.

Imagine placing a bet with your last $100. If overconfidence leads you to bet the entire amount and you lose, you're left with nothing and might go hungry for the night. On the other hand, if underconfidence results in a more cautious bet of $90 and you lose, you still have $10 left. Not a lot, but enough to secure a modest meal.

Brier Scoring

Brier scoring works well in cases where the directionality of the error doesn’t matter and the goal is about getting the closest estimate to the ground truth probability. Under this regime, in expectation, for an event with a ground truth probability of 1%, a forecast of 0% will yield a score equivalent to an estimate of 2%. This might seem counter-intuitive since, in the rare 1% of the time where the event occurs the forecaster estimating 0% gets a far worse score than the forecaster estimating 2%, but the low forecaster makes up the lost ground the other 99% of the time.

Here are the calculations for the expected value of always forecasting 0% when the ground truth is 1%:

$\frac{1 * ((1 - 0)^2 + (0 - 1)^2) + 99 * ((0 - 0)^2 + (1 - 1)^2)}{100} = 0.02$

And here are the calculations for the expected value of always forecasting 2% when the ground truth is 1%:

$\frac{1 * ((1 - 0.02)^2 + (0 - 0.98)^2) + 99 * ((0 - 0.02)^2 + (1 - 0.98)^2)}{100} = 0.02$

Although there are many cases in the real world where a forecast of 0% is far more costly than a forecast of 2%, counter examples where the costs are the same can be easily contrived. If this is at all unclear, feel free to ask for an example in the comments.

Log Scoring

Although Brier scoring does a good job of encouraging forecasters to get as close as possible to the true probability of an event, log scoring works better for cases where overconfidence is costly. As the forecaster’s estimate approaches absolute confidence, the penalty for being incorrect approaches infinity which leads most sites that use this method to put limits on the confidence a forecaster can apply (usually within either 1% or 0.1% of absolute confidence).

The log score is calculated using the natural logarithm of the forecaster's estimated probability for the true outcome.

Here are the calculations for the expected value of always forecasting as close to 0% as most sites will allow under log scoring rules when the ground truth is 1%:

$\frac{1* (−ln⁡(0.001)) + 99 * (−ln⁡(0.999))}{100} ≈ .7$

Here are the calculations for the expected value of always forecasting 2% under log scoring rules when the ground truth is 1%:

$\frac{1* (−ln⁡(0.02)) + 99 * (−ln⁡(0.98))}{100} ≈ .59$

As we can see, log scoring is more severe on those who forecast a low probability when a rare event actually occurs. This makes it a useful tool in contexts where overconfidence is not just misguided but potentially hazardous.

If you are in a situation where you can't afford to be too certain—say, risk assessments involving public safety or large financial stakes—log scoring may be the more appropriate system to adopt. It punishes overconfidence more aggressively and can therefore yield a set of forecasters who are not only accurate but also appropriately humble about the limits of their predictive capabilities.

On the other hand, if you’re concerned the directionality imbalance will disincentivize forecasters to report their true beliefs because its safer to err on the side of uncertainty, Brier scoring might be the better system.

Of course you’re probably thinking, well, if someone is erring on the side of uncertainty doesn’t that mean they’re truly uncertain? The answer isn't necessarily straightforward. Forecasters might indeed be truly uncertain, but the disproportionate penalty could actually lead to a bias toward reporting a less confident forecast to avoid the steep penalties for overconfidence that come with log scoring especially in situations where reputation is on the line. It might become less about making the most accurate forecast and more about not getting pushed out of the game. Computers won’t have this problem, but for most humans, I suspect its difficult to overcome the fear of the reputational hit that could be incurred in the event that they catch a particularly unlucky break.

Conclusion

To effectively identify top forecasters, the selection of a scoring system must be strategically aligned with the objectives of the forecasting initiative and the criticality of the decisions it influences. Here's how each system can be leveraged:

Market-Based Platforms: While they provide a snapshot of how the collective views the likelihood of an event, they are not optimal for pinpointing top forecasters due to the influence of market dynamics on price movements. These platforms are better suited for gauging market sentiment rather than individual forecasting prowess since smart-money usually gets the market reasonably close to the ground truth probability to start with.
Reputation-Based Platforms: These platforms are inherently designed to discern the forecasting elite. They focus on individual performance metrics, stripping away the market noise and isolating pure forecasting skill.
Error Directionality: In most real-world applications, the ramifications of overconfidence can be more detrimental than those of underconfidence. This distinction is critical when selecting a scoring system for identifying top forecasters.
Log Scoring: This system is ideal in environments where the implications of overconfidence are substantial—such as in public safety or large financial decisions—because it imposes severe penalties for overestimation, thus promoting caution and precision.
Brier Scoring: When the goal is to cultivate a culture of honest probability reporting, Brier scoring is advantageous. It encourages forecasters to share their true assessments without the looming threat of harsh penalties for occasional inaccuracies.

The choice of scoring system for identifying top forecasters should be a reflection of the forecasting environment's tolerance for risk and the importance placed on accuracy versus confidence. Log scoring is the go-to for high-stakes forecasting where overconfidence is a significant liability, while Brier scoring is apt for scenarios where truthful probability estimation is paramount and the environment is more forgiving of miscalculations.

Abstraction

Discussion about this post