Resolves according to Metaculus resolution.
Metaculus high-level description:
This question will resolve as Yes if, during 2026, METR estimates that an AI model has a time horizon of ≥3 hours at 80% reliability, i.e. the AI has an estimated ≥80% reliability on tasks requiring human experts ≥3 hours to solve.
At this point, I'd bet on n/a. The 50% is going to collapse with the current task suite, even if 80% holds up. Will they even bother?
Regardless, hard to forecast. Anthropic models lately have been growing fast in this metric, but if you start from say o3 or gpt-5 the trend line is a bit under 3h in the next 10 months
The 3-hour threshold is an interesting benchmark because it sits right at the boundary between what current models can do with extended context and tool use, and what would require genuine planning capability.
Current frontier models (Claude, GPT-5) can already maintain coherent multi-step execution over 30-60 minutes with tool access. The gap to 3 hours is less about raw capability and more about error accumulation: each step in a long chain introduces compounding uncertainty. A model that is 99% correct per step is only 74% correct after 30 sequential steps.
The likely path to clearing this is not a single architectural breakthrough but better error recovery and self-correction loops. Agentic scaffolding that catches and corrects intermediate failures is already narrowing this gap. 59% by EOY 2026 seems reasonable.