Last year, Scott Alexander reached out to ask Samotsvety to participate in the ACX Forecasting Contest to provide a benchmark for what a “top” forecasting team could do. I made forecasts on behalf of the group, but they were reviewed by several members to make sure there weren’t obvious mistakes. I’ve put together a few thoughts on the experience.
Contest Results
When the contest was scored, Samotsvety secured a position in the 98th percentile. This outcome not only surpassed the median superforecaster, who landed in the 70th percentile but also outperformed Manifold, with its 89th percentile score and even edged out the superforecaster aggregate which landed at the 97.5th percentile. Meanwhile, perhaps unsurprisingly, the Metaculus algorithm ended up in the 99.5th percentile. Although the rankings aligned with my expectations for overall capabilities, I was a little surprised to see so many well-calibrated contenders in the upper echelons of the contest results.
Wait… What?
Yeah, you heard me. I’ve previously expressed concerns about forecasting contests, and, for reasons that should be clear from that article, it’s not that uncommon to see some of the top slots in forecasting tournaments go to forecasters with decidedly unimpressive long-run track records. I was a little surprised to have done so well because I expected more participants to have “gone for broke” in order to win since there were prizes at stake and there were clusters of questions that I would have expected to be pretty correlated (e.g. Trump / Desantis).
It Happened Before
The go-for-broke strategy has actually happened in previous ACX forecast contests. For example, Zach Stein-Perlman, a top contender in the 2022 ACX contest wrote:
My expected score is slightly worse than it would be if I always gave my true probabilities. I mention this in case you want to exclude me from analysis for that reason. (The form says "there is no strategic advantage to putting anything other than your honest predictions for each event", but this is totally false: I don't care about expected score, just probability of doing very well. If Scott had said to be honest, I would have, but instead he said "The winner will get eternal glory" so I'm lazily* maximizing winning probability. *If the universe was at stake, I would consider other tactics, but it's not, so I'm just being overconfident.
Zach is absolutely right: within the confines of a single tournament, it usually pays to be overconfident within clusters of correlated questions, but selecting the optimal one-off strategy takes some finesse. Going for broke across too many question clusters increases the chances of having your overconfidence catch up with you.
It Probably Would Have Worked
As an example of how this strategy could have worked, look at questions 34 - 36 which I would consider as a correlated question cluster.
34. Will Bitcoin go up over 2023?
35. Will Bitcoin end 2023 above $30,000?
36. Will Tether de-peg in 2023?
Since Samotsvety has a long-run reputation to maintain, I gave my true probabilities when I made my forecasts. I deferred to a Montecarlo simulation which suggested that Bitcoin was relatively unlikely (~34%) to end the year above $30,000 although this was less confident than the Metaculus median which was even lower, so this helped my overall performance in the tournament. Meanwhile, if a participant wanted to “go for broke” on this cluster of questions, they might start by assuming that 35 would be true which would imply that 34 was also true and would suggest that Tether would be unlikely to de-peg. In hindsight, we know that this would have paid off, but, at the time I entered my estimates, I would have estimated this strategy to have a ~30% chance of paying off (i.e. 34% for Bitcoin going above $30K times ~88% that Tether wouldn’t de-peg given Bitcoin’s rally times a 100% chance that bitcoin would go up). Of course this strategy comes along with a ~70% downside chance of absolute humiliation for making absurdly overconfident forecasts, but, for some, maybe this is a risk worth taking for a chance at “eternal glory”. So if the go-for-broke participant had paired this with a strategy of extremizing anything within a 15% chance of an expected outcome according to the Metaculus medians (~41% chance of success1 assuming the medians were reasonably well-calibrated, but we know with hindsight this would have worked for 2023) while using the Metaculus medians for the rest, it would have had about a 12% overall chance of working (30% * 41%) and probably would have beaten the Metaculus algorithm placing the participant within striking distance of winning the entire tournament, especially if they made a few other confident forecasts. While it’s true that any single go-for-broke participant is unlikely to select the right question cluster(s), across more than 3000 participants, I would have expected at least some to stumble on to the right ones.
Why Didn’t Go-For-Broke Win This Year?
I expect some contestants did employ the go-for-broke strategy, but, as I mentioned before, it requires finesse, and, in a contest with this many relatively distinct correlated question clusters, I expect anyone who pulls it off to also be a pretty good forecaster. Of course, in some cases, extreme overconfidence looks the same as going for broke, but, when trying to optimize to win a tournament, it helps to have a more deliberate strategy. Here are my theories about why the day seems to have been carried by well-calibrated forecasters.
Go-For-Broke Participants Got Greedy
When employing the go-for-broke strategy, it’s easy to get greedy. Take the implications about Bitcoin’s assumed price increase and apply them to questions 30 - 33.
30. Will US CPI inflation for 2023 average above 4% in 2023?
31. Will the S&P 500 index go up over 2023?
32. Will the S&P 500 index reach a new all-time high in 2023?
33. Will the Shanghai index of Chinese stocks go up over 2023?
Operating as if we knew that Bitcoin would end 2023 at $30,000 would suggest that the interest rates probably stayed relatively low. Maybe they even went down. In either case, if we think interest rates never rose too much so as to prevent Bitcoin from rallying, it indicates inflation probably came down from its highs (maybe a 70% chance). That leads us to update on question 30 (CPI inflation). It seems more unlikely since we suspect interest rates never went that high. Let’s go for broke and put question 30 as 0%. What about questions 31 (S&P goes up) and 32 (S&P new high)? If we think CPI came in low and interest rates stayed down, the S&P 500 probably went up (85%) and might have even hit an all-time high (maybe still only a 55% chance given that we were down so much earlier in the year). While question 31 came within a hair’s breadth of happening, extremizing on this question would have tanked the game. Now add insult to injury with question 33 (Shanghai index goes up) which we might have also assumed to be more likely to have gone up and thus extremized, but, in hindsight, we know that would have gone very badly.
Go-For-Broke Participants Picked the Wrong Clusters
As per the previous section, if we had started with the question cluster including questions 30-33, there wouldn’t have been a successful way to convert that into a winning extremized narrative for 2023. Starting with the most unlikely outcome that the S&P 500 would hit an all time high would have been ruinous in 2023 even though it came very close to happening and would have led to many other correct conclusions. With log scores, overconfidence leads to eternal damnation.
Go-For-Broke Picked the Wrong Side Within a Cluster
Picking the wrong side of a bet is ruinous when going for broke and within correlated clusters, there are cascading implications. For example, questions 17 and 18 were:
17. At the end of 2023, will prediction markets say Donald Trump is the most likely Republican nominee for President in 2024?
18. At the end of 2023, will prediction markets say Ron DeSantis is the most likely Republican nominee for President in 2024?
It’s possible that both could have resolved as false and an entirely different candidate could have taken the lead, but if either one of these had resolved as true it would imply the other would have to resolve as false. If someone had gone all in on Desantis it would tanked them both on the Desantis and the Trump questions.
Similarly if someone had wanted to go all in on Gavin Newsom, they would have lost both on that and on the Biden question and we now know it wouldn’t have worked out in this universe.
The Tournament Was Well-Designed
Even though there were many correlated question clusters, there were enough distinct / loosely tied clusters to make it hard enough to game the system this year. When I see absurdly overconfident forecasters win tournaments it’s usually when the contest is limited to a few highly correlated questions (e.g. an elections forecasting tournament).
Most People Wanted to Do Their Best “Honestly”
I should make it clear that although we’ve never met, I have a lot of respect for Zach (the 2022 high-performing contender mentioned above) and I don’t think it’s unethical to go-for-broke as part of having fun in an online contest, especially if you make it clear that’s what you’re doing, but I would guess the majority of participants just wanted to play according to the intention of the contest and see how they would stack up.
Final Thoughts
I’m pretty confident the Metaculus algorithm came up with objectively better-calibrated forecasts than me and that many of the other leading participants, like Peter Wildeford2 and Ezra Karger3, really did employ robust strategies that were not just luck or overconfidence. Although there were probably a few lucky forecasters in the top ranks this time, I think many of the contest’s top contenders will continue to be top performers in the years to come. On the other hand, it’s almost certain that Samotsvety would have done better if we had all contributed independent forecasts and then aggregated them as per our usual method, but, unfortunately, the group just didn’t have the bandwidth. The group was able to steer me away from at least one clearly suboptimal forecast, but with the benefit of hindsight there are definitely a few I could have done better on. If I get the time, I’d like to write up a formal post-mortem on a few of these, but I have so many half-completed drafts in my queue that I probably won’t get around to it until 2027, and, by that time, I’ll be able to just have an AI do it for me. For that matter, I kind of expect AIs to be better forecasters than me for most questions by then too, but that’s a topic for another time. Although I had a lot of fun thinking about the questions for 2023, due to personal time constraints, I didn’t get a chance to participate for 2024, but I’d like to deep dive on select questions if I get a chance. Meanwhile, Jack4 made some very good forecasts for this year’s contest which I mostly agreed with. If you want to peer ahead into the future, I’d suggest starting there.
This year, all the questions with a Metaculus median within ~15% of certainty resolved to their expected outcome.
These were:
1. Will Vladimir Putin be President of Russia? ~85%2. Will Ukraine control the city of Sevastopol? ~6%
9. Will a nuclear weapon be used in war and kill at least 10 people? ~2%
10. Will China launch a full-scale invasion of Taiwan? ~3%
13. Will any other war have more casualties than Russia-Ukraine? ~10%
16. Will prediction markets say Gavin Newsom is the most likely Democratic nominee for President in 2024? ~9%
26. Will the UK hold a general election in 2023? ~15%
39. Will OpenAI release GPT-4 in 2023? ~87%
50. Will someone release "DALL-E, but for videos" in 2023? ~90%
The chance of there not being a surprising outcome within this group would be
Peter was 12th this year and 20th last year. He also consistently makes good bets on Manifold.
In regards to this contest, Ezra wrote:
I began by collecting data from Manifold Markets for these questions. I then compared those forecasts to the forecasts of superforecasters in the blind data, subset to those who had given forecasts on the S&P500 and Bitcoin questions that were reasonably consistent with the efficiency of markets; I subset to those who forecasted between 30% and 80% for the probability that the S&P500 and Bitcoin would increase during 2023, which were the only reasonable predictions by the time blind mode ended in mid-January. I then used my own judgment to tweak forecasts where I strongly disagreed with the prediction markets and the superforecasters (for example, I was more than 15 percentage points away from the average of Manifold Markets and the efficient-market-believing superforecasters on questions 17, 19, 21, 30, 34, and 50). I paid especially close attention to questions where late-breaking news made the superforecasters' forecasts less relevant (and I downweighted their forecasts on those questions accordingly).
> As an example of how this strategy could have worked, look at questions 34 - 36
Incredible, I believe this was actually the exact set of questions I employed this strategy with. I forget if I took the bear or the bull side though.