I recently took Metaculus’ Scoring Survey, and started thinking about the ideal properties of a forecast scoring system. While writing about those properties, I realized that we should probably first review a few foundational concepts starting with scoring rules.
Proper Scoring Rules
When we talk about forecast scoring systems, one concept that frequently surfaces is that of "proper scoring." In a nutshell, a proper scoring rule is designed to incentivize forecasters to report their true beliefs rather than game the system. In an ideal world, we'd want our scoring system to encourage accurate, honest forecasting, not strategic hedging or exaggerated claims.
What sets proper scoring systems apart is their mathematical structure, which penalizes dishonesty as well as over and under-confidence. That is to say, if the ground-truth probability of an event happening is 75%, then forecasting that probability will maximize your expected score and reporting a different probability, will lower your expected score. This feature aligns the forecaster's interests with truthful reporting, creating a more reliable landscape for predictions.
Improper Scoring Rules
In contrast to proper scoring rules, improper scoring rules don't incentivize forecasters to report their true beliefs. One example of an improper scoring rule is the Mean Absolute Error (MAE)1. While easy to understand and implement, MAE often leads to suboptimal forecasting incentives, and forecasters can actually lower their expected error by reporting an estimate that is not their true belief.
A Conniving Scheme
Imagine a politician wants to cut costs on school lunches while throwing a bone to his fast-food-chain-owning donor base. The plan replaces existing lunches with the leftovers from these chains. The catch is, since the food is old, it poses a 5% monthly ground truth risk of bacterial contamination.
Deceptive Incentives
After vocal criticism of the plan due to safety concerns, the politician comes up with a wonderfully awful idea: enlist food safety experts to forecast the risk of contamination on a monthly basis, offering a $5 million prize for the lowest MAE, to be awarded after 18 months of forecasts. To further muddy the waters, the politician opens up the contest to fast-food franchise owners, arguing that the "true experts" will be revealed through the outcome of the competition and the monetary reward for accuracy will keep everyone honest since math doesn’t lie.
Manipulated Consensus
The participants consist of 4 food safety experts and 5 franchise owners. Each month the food safety experts forecast a 5% risk while the franchise owners forecast a 0% risk. The politician makes sure that it is widely reported that the consensus of the crowd (as indicated by the median forecaster who just so happens to be a franchise owner) is a 0% risk. For the first 10 months, there are no incidents, but on the 11th month, there is an outbreak of food-borne illness. Naturally the public is outraged, but there was always a “negligible” risk that this could happen. The politician urges the public to trust the consensus that this is very unlikely to happen again. No further events arise before the end of the competition.
The Coup de Grâce
The food safety experts had an error of
Meanwhile, the franchise owners had a lower error of
The naysaying food safety experts are ridiculed for being overcautious. The franchise owners win and divide the prize money. Their investment in the politician has been a wise one.
Takeaway
With a 5% monthly ground truth risk, the MAE system actually incentivized a 0% risk forecast to win the prize, aligning perfectly with the politician's agenda of misrepresenting the real risk. The prize money also served as a generous kickback to his key donors.
Proper Scoring In Practice
Let's take a closer look at how proper scoring works in practice by using a specific example with Brier scoring2 (basically the mean squared error where the lower the score, the better). Imagine a situation where a forecaster and the crowd are estimating the likelihood of an event where the forecaster estimates a 75% chance of the event occurring, which happens to be the ground truth, but the crowd is estimating an 80% chance.
In a Single Round
In one instance of this event, let's suppose the more likely outcome happens and the event occurs. For the forecaster, the Brier score would be
For the crowd, it would be
In this instance, the crowd would have a better Brier score for this single event (lower error), even though the forecaster's prediction was actually the ground truth.
Over Multiple Rounds
However, over the long run, the situation changes. Let's assume this event occurs multiple times, and each time, the forecaster and the crowd provide the same estimates. In 75 out of 100 instances, consistent with the ground truth, the event occurs, and in the other 25, it doesn't.
For the forecaster, the average Brier score would be
Meanwhile, for the crowd, the average Brier score would be
The story is similar with Log scoring; over many rounds, the forecaster’s more accurate estimate gradually accrues a better score than the crowd's slightly overconfident one.
While they may not always reward the most accurate forecasters immediately, proper scoring rules are an excellent mechanism for distinguishing skill from noise and promoting honest, calibrated forecasting.
Within a Question Period
Another thing to note is that this can sometimes even play out within the period of a single question as forecasters make updates. For example, suppose a forecaster is trying to place in a tournament. They might be incentivized to make overconfident forecasts to try to place well, especially if there are a small number of questions or if the questions are highly correlated (this is why I’m against forecasting tournaments). Let’s suppose one of the questions has a ground truth 75% chance of happening given the current available information and we’ll further suppose that the crowd is right around 75%. Well, a forecaster isn’t going to win any prizes for just forecasting the same thing as the crowd, so maybe they decide to take a risk and forecast 100%, after all, they have a 75% chance of getting extra points for the tournament. Now suppose new information comes out a quarter of the way through the question that updates the ground truth to 25% and the crowd adjusts accordingly. Our prize-seeking forecaster reverses course and whipsaws down to 0% hoping to make up for lost ground. Things continue along at that level for the next two quarters of the question, but then, in the final quarter, new information is released that updates the ground truth and the crowd back to 75%, so our forecaster goes back to 100%. Finally, the question resolves positively. Let’s take a look at the resulting Brier scores starting with the crowd.
The crowd spent half the time at 25% and half at 75% so its Brier score will be:
Meanwhile our forecaster’s Brier score will be:
Well, well, well, crime doesn’t pay. Our forecaster ends up with a relative Brier of:
Ouch! Even though overconfidence can sometimes be helpful in winning tournaments, it did more harm than good in this case. Overconfidence penalties are even more pronounced with log scoring which we’ll look at in a later post.
MAE formula: |outcome - outcome_estimate| for a single event. For multiple events, sum the event outcomes and divide by the number of events. This scoring system incentivizes extreme forecasts (either 0 or 1).
Brier score formula: (event – event_estimate)2 + (nonevent – nonevent_estimate)2
I'm really interested in your thoughts on when it makes sense to use log scoring! It occurs to me that one striking illustration of the problem with mean absolute error is that if something happens 50% of the time, your MAE will be the same no matter what your forecast is. If you said that coin flips never come up heads, your MAE will be the same as it would be if you correctly predicted they'd up heads half the time.