Deploying a Forecasting Bot

making testing and iteration affordable

Jan 05, 2026

My forecasting bot is now live in the Metaculus AI Forecasting Benchmark.

In my last post, I argued that this tournament represents forecasting’s “ImageNet moment” - a chance to test methods empirically at scale. Now the work begins.

The Forecast Pipeline

Most AI forecasting bots take a similar approach: feed the question to an LLM, maybe do some web search, ask for a probability. Some get fancy with multi-agent debate or ensemble averaging across models.

My bot runs each question through a structured pipeline: a sequence of steps designed to mimic how good human forecasters actually think. The pipeline details will come in future posts as I reveal each stage, but the flow mirrors how I operate as a superforecaster.

Codifying it into a reproducible system that can be tested and refined? That’s the experiment.

Top Priority: Reduce Costs

I applied for LLM credits from Metaculus, but I haven’t heard back yet and I’m currently paying out of pocket for LLM usage (AskNews generously provided free API access for news retrieval over the weekend - thank you!).

To make this work for my purposes, I need lots of iterations. The tournament has hundreds of questions. That requires running the bot many times. And running LLMs many times gets expensive fast. I want to test different approaches, compare results, and figure out what actually works. In practice, that means before I could iterate on methodology, I needed to make iteration affordable.

The unmodified template bot cost:

$0.109 per question

After my cost-reduction updates:

$0.004 per question

That’s a 27x reduction.

At the old rate, running the bot across 500 questions would cost ~$55. Now it’s ~$2. I can test dozens of different approaches for what one used to cost.

The Trade-off

I expect a performance hit from these changes. Cheaper models, fewer tokens, less elaborate reasoning. That’s fine for now.

The goal at this stage isn’t to win, it’s to learn. Which parts of the pipeline matter most? Where does additional compute actually help? What’s signal versus noise?

Once I understand what works, I can selectively add compute back where it counts.

What’s Next

The bot is running. Data is accumulating. Over the coming weeks, I'll share results, break down each pipeline step, and adjust based on evidence.

This is forecasting about forecasting using the tournament as a feedback loop to refine the methodology itself. The first real test: does a structured pipeline, even a cheap one, beat naive approaches?

We’ll find out.

Abstraction

Discussion about this post

Ready for more?