Classification and Method Selection

matching questions to the right forecasting approach

Jan 08, 2026

This is the third post in a series on building an AI forecasting bot. In the first post, I argued that forecasting bots represent an opportunity to test methods empirically at scale. In the second post, I got the bot running and cut costs by ~30x to enable experimentation and rapid prototyping. In the third post, I covered the Broken Leg check, the first step in the pipeline, and a short-circuit for questions already resolved by breaking news or overwhelmingly likely to fall on a particular side.

Now we get to the core of the pipeline: figuring out what kind of question you’re dealing with and applying the right method.

Why Classification Matters

Not all forecasting questions are the same. “Will Bitcoin drop 25% from its peak before 2030?” is fundamentally different from “Will the US Constitution be amended to allow a third presidential term by 2028?” The first has historical data you can extrapolate from. The second is asking about a rare, discrete event with no direct precedent.

A good human forecaster recognizes this immediately and adjusts their approach. They don’t apply the same mental model to every question. Neither should a bot.

In real life, I might have 50 ways of classifying and forecasting a question, but I don’t have time to encode them all into a bot. Fortunately, I think 4 approaches capture ~80% of the value.

My pipeline classifies each question into one of four types, then routes it to a specialized method:

BASE_RATE: Questions about events with identifiable reference classes
TIME_SERIES: Questions about quantities with historical trends
CONDITIONAL_CHAIN: Questions requiring multiple independent conditions
NOVEL_EVENT: Questions about unprecedented situations

I’m not the only one taking this approach. On the Metaculus AIB Resources page, mmBot describes a similar strategy:

My bot first does research planning by decomposing the question and generating targeted search queries, scraping and summarizing the most relevant articles (llm-driven). It then classifies the question and routes it (also llm-driven) to one of several specialized context pipelines (eg. stock, sport, election, etc). Finally the optimized prompt and context is used to prompt an llm for a prediction several times which are then ensembled into the final forecast.

The intuition is the same: different question types benefit from different approaches. The classification step costs a few tokens but determines the appropriate method.

The Four Methods

Base Rate

Most forecasting questions are base rate questions in disguise. “Will X happen?” is really asking “How often do things like X happen, and is there any reason to think this case is different?”

The method:

Identify the reference class: what category does this event belong to?
Find the historical base rate for that class
Adjust for specific conditions that make this case more or less likely
Update based on current evidence

For example: “Will it snow in New York City on December 25th, 2026?”

Reference class: December 25ths in New York City. Historical rate: about 20%1 of Christmas Days in NYC have measurable snowfall since records began. Adjustment: current climate trends (slightly warming), La Niña/El Niño conditions for that year, whether there’s existing snow cover. The base rate anchors the estimate; the adjustments move it.

The danger is picking the wrong reference class. “Christmas Days in NYC” gives you one rate. “December days in NYC” gives you another. The skill is in choosing the class that’s most predictive for this specific question.

Time Series

Some questions have quantitative historical data that can be projected forward. “Will Bitcoin drop 25% from its peak before 2030?” is asking about a price trajectory with years of history.

The method:

Extract relevant historical parameters (volatility, trend, mean reversion)
Build a simple model of the process
Run Monte Carlo simulations forward to the resolution date
Calculate the probability of crossing the threshold

This works well when the underlying process is relatively stable and you have enough history to estimate its parameters. It works poorly when the process itself might change or when you’re extrapolating far beyond your data.

For crypto prices, the historical volatility is high enough that a 25% drawdown over a is very plausible and the time series method captures this by simulating many possible paths and counting how often they cross the threshold.

Conditional Chain

Some questions are really asking about a conjunction of independent events. Consider: “Will JD Vance be president on January 20, 2029?”

This isn’t a single event, it’s a chain of conditions that all need to happen:

Will he run for president?
Will he win the Republican primary?
Will the general election happen as scheduled?
Will he win the general election?
Will nothing prevent him from being sworn in?

Each step has its own probability conditional on the previous step which can be multiplied together to produce the forecast. Even if each step is reasonably likely—say 92%, 66%, 98%, 45%, 99%—the product is about 27%. Much lower than any individual step.

The method:

Decompose the question into necessary conditions
Estimate each condition’s probability (often using base rates)
Assess independence and adjust for correlation
Multiply through for the final estimate

This method can be brutally deflationary. Humans chronically underestimate how unlikely conjunctions are. A chain of “probably” events quickly becomes “probably not.”

The skill is in the decomposition. Too few steps and you miss crucial dependencies. Too many and you’re multiplying noise. You want the load-bearing conditions that actually determine the outcome.

Novel Event

Some questions have no reference class and no historical data. “Will a U.S. or U.S.-ally satellite be permanently disabled by another country or organization before January 1, 2027?” is asking about something that hasn’t happened yet (at least not in a confirmed, public way).

For genuinely novel events, I use a Laplace prior. The intuition: if something hasn’t happened in N attempts, your best estimate for it happening next time is 1/(N+2). This comes from Laplace’s rule of succession, a principled way to reason about events with zero observations.

The method:

Pick a reasonable starting point for when the event could have begun happening. For satellite attacks, we might select 1995 since that’s when GPS became fully operational and space infrastructure became a meaningful military target. That gives N=30 years of opportunities with zero confirmed events.
Calculate the per-period probability: 1/(N+2) = 1/32 ≈ 3.1% per year.
Apply it over the remaining window. With about 1 year until the deadline, the cumulative probability is 1−(1−1/32)^1 ≈ 3.1%.
Adjust for evidence not captured in the base calculation. There are more satellites now than in 1995. There have been suspected cyber attacks that might have temporarily disabled satellites. Tail risks like Russian escalation or Taiwan conflicts could change the calculus. Nudge the estimate as appropriate.

The key insight is that “unprecedented” doesn’t mean “impossible.” It means you should be uncertain and should flag that uncertainty clearly.

Classification in Practice

The classifier is a prompted LLM call that takes the question text and outputs one of the four categories plus a brief justification. It’s fast, reasonably accurate, and provides real uplift for forecasts in empirical results so far. Better instructions let us get away with using smaller, cheaper models without loosing too much in performance.

The tricky cases are questions that could plausibly fit multiple categories.

“Will the US recognize Taiwan before 2029?”

This could be:

BASE_RATE: How often do countries make major diplomatic recognition changes?
CONDITIONAL_CHAIN: Requires US decision + willingness to accept China’s response + no reversal
NOVEL_EVENT: The US-Taiwan-China situation is historically unique

In ambiguous cases, the classifier picks the method most likely to produce a calibrated estimate. Usually that means defaulting to base rate with adjustments, since that’s the most robust approach when you’re uncertain about structure.

Why Split It Out

Having the methods separated lets me track what works and what doesn’t. Each question in my dataset records which classification it received, how the method-specific logic processed it, and what probability came out. As questions resolve, I’ll be able to answer:

Which classifications are most accurate?
Are certain methods systematically overconfident or underconfident?
Do some question types benefit more from structured reasoning than others?
Where should I invest effort in improving the pipeline?

This is the whole point of running at scale. Instead of debating whether conditional decomposition “should” work in theory, I can measure whether it actually works in practice. The tournament generates the feedback loop.

don’t quote me on this, I’m just making it up as an example

Neural Foundry

Jan 9

Nice framework for structuring forecasts. The conjuction deflation problem in conditional chains is something I see constantly - people will say each step is "70% likely" without realizing that five steps at 70% each gives you 17% total. Had a project manager once who kept piling"probably fine" assumtions onto a roadmap and then was shocked when nothing shipped on time.

Bolton

Jan 8Edited

> 1−(1−1/32)^1 ≈ 3.1%.

This is a nice technique, but I find this math is somehow missing some scale-awareness. Suppose we had taken our "period" for assessing the per-period probability to be months instead of years. Then we would have calculated a 1/(30*12 + 2) = 0.002762430939 per-month chance of satellite attack, and substituting in values appropriately to the above equation, we would get a slightly higher chance

1 - (1 - 1/(30*12 + 2))^12 ≈ 3.26%

If we had taken it to be a decade, then we would have calculated a 1/(3 + 2) = 0.2 per-decade chance of satellite attack, and substituting in values, we would get something lower:

1 - (1 - 1/(3 + 2))^(1/10) ≈ 2.21%

Is there some art to choosing the period over which we average? Is it a good idea to take the limit as the period gets smaller and smaller?

1 reply by Jonathan Mann

1 more comment...

Abstraction

Discussion about this post

Ready for more?