If they aren’t designed carefully, forecasting tournaments can favor biased forecasters.
Forecasting tournaments often focus on areas where the question outcomes are highly correlated so that a bias in the direction the questions happen to break contributes to a better score for the particular tournament even though the same bias leads to worse scores in the long run.
For example, with US election forecasting tournaments, if the Democrats win a contested race in Pennsylvania, it means they’re more likely to also win contested races in Georgia and Nevada. For that particular tournament a bias overestimating the Democrats’ chances is likely to yield a better score.
Similarly, I suspect Trafalgar Group’s apparent success in calling certain states in the 2016 election has more to do with a biased methodology favoring Republicans than superior polling techniques.
Example Tournament
To illustrate the problem, let’s construct a toy forecasting tournament where a hedge fund is seeking to harness the wisdom of crowds to estimate whether some of the leading tech stocks will increase in a given year. They’ll run the tournament annually from 2019 - 2022 and each year, they’ll award $10,000 to the forecaster with the best relative Brier score (if you’re not familiar with the Brier score concept, think of it as the mean squared error where a higher score means more error and is worse). We’ll use real historical data and compute the actual Brier scores of the various participants.
The Questions
Question 1: Will Apple stock have a positive return by the end of the year?
Question 2: Will Google stock have a positive return by the end of the year?
Question 3: Will Meta stock have a positive return by the end of the year?
Question 4: Will Microsoft stock have a positive return by the end of the year?
Question 5: Will NVIDIA stock have a positive return by the end of the year?
Question 6: Will Tesla stock have a positive return by the end of the year?
Question 7: Will Bitcoin have a positive return by the end of the year?
The Participants
Billy is bull-biased. No matter what happens, he thinks the economy will boom and the stock market will rally, and, since a rising tide lifts all boats, every year he goes all in with 100% yes on every question.
Barry is bear-biased. No matter what happens, he thinks the good times are over and the economy will falter. Every year, he goes all in with 0% on every question.
Callie is well-calibrated. She knows that in a given year, the stock market has about a 75% chance of rising, so she anchors around that assumption makes adjustments as she finds relevant data to update on. To simplify, we’ll just assume that she uses the 75% value for all of her forecasts.
The Data
The chart below shows the results of the contest (you can audit the code / data used to generate the chart from real historical data here to see that I’m not making this up). We can see that, in a given year, the questions are perfectly correlated.
The Results
2019 Winner: Billy | Relative Brier: -0.875
2020 Winner: Billy | Relative Brier: -0.875
2021 Winner: Billy | Relative Brier: -0.875
2022 Winner: Barry | Relative Brier: -7.875
From 2019 - 2021, Billy cleaned up in the tournament. Each year, he handily beat the median forecast (Callie) for a relative Brier score of -0.125 on each question and an impressive relative Brier score of -0.875 for the year’s tournament. Billy felt like a genius. Meanwhile, since Callie was the median forecaster, she received a relative Brier score of 0 for each of those years. Finally Barry, ended up with a horrendous relative Brier score of 1.875 per question and a yearly tournament score of 13.125, but he never lost faith. In Barry’s mind, the inevitable downfall would just be all the greater when it finally arrived.
In 2022, the tables turned. As the stock market tanked, Barry felt vindicated. It seemed people were finally starting to wake up to the reality that Barry knew all along. Barry ended the year with an impressive relative Brier score of -1.125 per question resulting in a score of -7.875 for the tournament that year. Meanwhile, Billy’s perfect track record was marred. He ended the season with a relative Brier score of 0.5 per question resulting in a tournament score of 3.5 for that year. Even still, Billy was sure things would turn around soon and that markets were just being irrational. After all, he had three years of a solid track record beating the crowd.
Conclusion
Running a good forecasting tournament is hard. If the four years of the tournament were taken together, Callie would be perfectly calibrated and have the best overall Brier score, and the relative Briers would be as follows:
Callie: 0.0 | Billy: 3.5 | Barry: 31.5
The problem is, it can take a long time or a lot of diverse questions to weed out bias. Sometimes very good forecasters do win tournaments such as with Greg Justice in the Think Again tournament and with Jared Leibowich in the World Ahead 2022 forecasting challenge, but it means that they have to compete on a variety of questions that require a level of research and focus that can be difficult to find time for. Meanwhile, an important takeaway is to be appropriately skeptical and not read too much into the results of any one tournament. To read more about other problems and distortions in forecasting, I suggest looking at this paper on Alignment Problems With Current Forecasting Platforms by Nuño Semperea and Alex Lawsen.
Interesting post. This reminds me of the difference between cash games and tournaments in poker, and how they require different strategies. In cash games (where you can cash in your chips and leave at any point) it makes sense to play conservatively and try to win a couple bucks, then leave when you're ahead. But in tournament-style poker, you have to play much more aggressively and take bigger risks because it's really all or nothing.