The Bitter Pipeline
frontier models made my forecasting pipeline obsolete
When I started building a forecasting bot for the Metaculus AI Benchmark, I was spending my own money. That meant using smaller, cheaper models and squeezing out every bit of signal I could through methodology.
I built a structured pipeline that approximated how a human forecaster would approach a question. On my initial tests, this process-driven approach outperformed the raw model by a healthy margin, and it felt like I was really onto something.
Then Metaculus gave me API credits, and suddenly I could afford frontier models.
I ran my pipeline against these frontier models, expecting my clever methodology to stack on top of their superior reasoning. Instead, I crashed into what so many before me have experienced: the Bitter Lesson.
The Bitter Lesson
The Bitter Lesson is Rich Sutton’s observation that in AI, methods that scale with computation tend to outperform approaches that rely on human-designed structure or domain knowledge. My forecasting pipeline was my human-designed structure, and it didn’t add anything to what the frontier models already provided.
What the Data Showed
My process-driven pipeline, which had shown promise with smaller models, didn’t add noticeable value with frontier models. In my testing, a better use of tokens was simply polling larger models more times to build a larger ensemble, rather than spending tokens on classification, routing, or decomposition.
Why This Happens
My best guess is that, with weaker models, the pipeline helped because it imposed structure on reasoning that the model didn’t reliably produce on its own. It was scaffolding for limited capability.
Frontier models don’t seem to get much value from this scaffolding. They can reason through base rates, consider multiple scenarios, and synthesize information without being explicitly told to. The pipeline was just consuming tokens that could have been better spent on additional polling.
What I’m Doing Now
I’ve simplified. The elaborate pipeline has been replaced with more polling, and I have new theories I’m testing against the crowd. My testing suggests there may be a way to elicit latent knowledge from models that meaningfully improves forecasts, but I need more data before I’m confident the effect is real.
Testing and Validation
I used Metaculus community probabilities as a calibration benchmark. A baseline frontier model might correlate at r ≈ 0.65, while aggregating multiple frontier models correlated at r ≈ 0.74. Including the pipeline did not provide additional uplift over ensembling.
While not perfect, this provided a useful sanity check. This kind of empirical validation is what makes forecasting bots useful. We can actually test what works instead of arguing about theory.
The Lesson
I started this project trying to stretch small models with clever methods, but it turns out the winning approach was just using better ones. Sometimes the boring answer is the one that actually scales. For now, with frontier models, keep it simple: ensemble aggressively and validate empirically.



I agree with you that elicitation has strong potential to improve model performance. I have released a library of named reasoning moves that can be elicited from models by calling the skill. See this thread for two of them applied to this post, which may be useful in your forecasting:
https://claude.ai/share/089eddbf-d19c-4eab-8aa2-3b2f197d0fa7
Can we infer anything about human intelligence from your conclusion?