Many Top Forecasters Aren't That Good
luck can make mediocre forecasters seem better than they are
Forecasting performance evaluation is muddled by the convolution of skill and luck. In a realm where numerous forecasters make predictions over a large sample of events, a critical question emerges: how often are the top performers genuinely skilled as opposed to just lucky?
Track Record Limitations
While a forecaster’s track record is a useful indicator, there are limits on the conclusions we can reliably draw. While it’s true that, over a long enough time span, a forecaster’s improper calibration will catch up with them, all but the most prolific forecasters have at most dozens of resolved questions. So when can we be confident that a forecaster is actually adding signal rather than noise?
Bernoulli Trials in Forecast Evaluation
To evaluate whether a forecaster is skilled or merely lucky, we can use the concept of Bernoulli trials. In a Bernoulli trial, there are only two possible outcomes: success or failure. By treating each forecast as a series of Bernoulli trials, we can apply statistical hypothesis testing to assess the likelihood that the observed performance was due to luck rather than skill.
Imagine a forecaster who, time and again, predicts a 100% chance for events that, in reality, have a 95% probability of occurring and where the crowd forecasts the ground truth probability accurately. Further suppose that they have a perfect track record over a history of 40 resolved questions. At first glance, their record of success might seem impressive. However, this could be a classic case of picking up pennies in front of a steamroller—reaping small gains in the short term while ignoring the looming risk of a catastrophic loss. At the outset, we know the most likely outcome would be to observe 2 negative resolutions out of the 40 questions, but what are the chances that all 40 would resolve positively?
Hypothesis Testing
We set up our hypothesis test with the following:
Null Hypothesis (H0): The forecaster's success is due to luck (true probability of success for each event is 95%, as estimated by the crowd).
Alternative Hypothesis (H1): The forecaster's success is due to skill (probability of success differs from 95% for each event).
Using the binomial distribution, which is appropriate for a series of independent Bernoulli trials, we calculate the probability of observing the forecaster's success rate under the null hypothesis.
A Lucky Streak?
Imagine a forecaster predicts a 100% chance of success for 20 events, and all 20 events occur. At first glance, their forecasting ability seems remarkable. However, our hypothesis test reveals a different story. With a P-value of approximately 0.358, we find there's a 36% chance of such a success rate occurring by luck alone, given the crowd's 95% estimation. This high P-value suggests we shouldn’t confidently attribute the forecaster's success to skill.
The problem is compounded by the sheer number of people forecasting. With so many forecasters, some will inevitably get lucky. Moreover, there are many forecasters who are at least somewhat talented and happen to ride a wave of luck on the questions they're wrong about. This mix of partial skill and good fortune can create the illusion of consistent proficiency.
Another critical factor is the issue of correlated questions. This can be particularly pernicious in scenarios like election forecasting. For instance, if a forecaster is overconfident in a particular party and happens to be correct by luck, they will likely outperform the crowd across most questions in that category. This correlation can significantly skew the perceived accuracy and skill of the forecaster, as their success is not independent across different questions.
Additionally, polling error often tends to be skewed in a particular direction due to systemic biases, methodological issues, or unforeseen events. A forecaster who aligns with the direction of this skew can appear highly accurate purely due to this bias. When the polling error benefits a particular prediction, it can reinforce the forecaster's apparent success, further complicating the evaluation of their true skill.
Conclusion
This underscores the importance of prudence in forecaster evaluation. Relying on statistical methods like hypothesis testing provides a more grounded assessment of a forecaster's ability. It reminds us that what might look like skill could simply be luck, or some combination of luck and skill. It also reminds us why it’s important to look at crowd medians instead of relying too much on any single forecaster. Aggregated forecasts leverage the wisdom of the crowd, reducing the impact of any single person's lucky streak and providing a more reliable and balanced perspective.