Claude Mythos Performance Benchmarks

A data-driven comparison of Claude model performance across coding, reasoning, multimodal, and agent benchmarks — and what to expect from Mythos.

Claude Mythos Benchmark Status

As of March 29, 2026, Anthropic has not published a system card, model card, or any specific benchmark scores for Claude Mythos. No hallucination rate metrics, factuality measurements, or evaluation harness details have been made available to the public. The model remains in early access testing, and all quantitative performance claims originate from leaked draft materials that have not been independently verified.

What we do know comes from two sources. First, leaked internal drafts assert that Mythos achieves "dramatically higher scores" across all tested benchmark categories compared to Claude Opus 4.6. Second, an Anthropic spokesperson confirmed to Fortune that the model represents "meaningful advances in reasoning, coding, and cybersecurity" — though no numbers were attached to that statement.

Until Anthropic publishes an official system card, the benchmark landscape for Mythos remains speculative. What we can do, however, is establish a rigorous baseline: the known, published performance figures for Claude 4.6 models and their cross-vendor competitors. These baselines tell us exactly how high the bar is, and what "dramatically higher" would need to mean in practice.

Claude 4.6 Known Benchmark Baselines

The following scores are drawn from Anthropic's published Sonnet 4.6 System Card and Google's Gemini 3.1 Pro Model Card. These represent the current frontier as of early 2026 and serve as the reference frame for evaluating any Mythos claims.

Coding and Engineering Benchmarks

BenchmarkOpus 4.6Sonnet 4.6Gemini 3.1 ProGPT-5.2-Codex
SWE-bench Verified80.8%79.6%80.6%
Terminal-Bench 2.065.4%59.1%68.5%Opus 4.6 ranked #1*

*Anthropic states Opus 4.6 ranked #1 on Terminal-Bench 2.0 surpassing GPT-5.2-Codex, though the exact GPT-5.2-Codex score is not published in Anthropic's system card.

SWE-bench Verified measures a model's ability to resolve real-world GitHub issues end to end — reading issue descriptions, navigating codebases, and producing working patches. Opus 4.6 leads at 80.8%, a narrow but meaningful margin over Gemini 3.1 Pro's 80.6%. Terminal-Bench 2.0, which evaluates terminal-based coding tasks, shows a wider spread: Gemini 3.1 Pro scores 68.5% versus Opus 4.6's 65.4%, though Anthropic claims the #1 ranking when GPT-5.2-Codex is included in the comparison set.

Reasoning and Knowledge Benchmarks

BenchmarkOpus 4.6Sonnet 4.6Gemini 3.1 Pro
ARC-AGI-2 (Verified)68.8%58.3%77.1%
HLE — Humanity's Last Exam (Partial)33.2% (no tools) / 49.0% (tools)

ARC-AGI-2 tests abstract reasoning and novel pattern recognition — skills that resist rote memorization. Here, Gemini 3.1 Pro holds a commanding lead at 77.1%, compared to Opus 4.6's 68.8%. The 8.3-percentage-point gap is one of the largest between these frontier models on any single benchmark. Humanity's Last Exam (HLE), a partially available evaluation of expert-level knowledge, shows Sonnet 4.6 at 49.0% with tool use — no Opus 4.6 or Gemini 3.1 Pro scores have been published for this benchmark.

Agent and Multimodal Benchmarks

BenchmarkOpus 4.6Sonnet 4.6Gemini 3.1 Pro
OSWorld-Verified (GUI/OS Agent)72.7%72.5%
MMMU-Pro (Multimodal)74.5% (no tools) / 75.6% (tools)
BrowseComp (Agent Search)84.0%74.7%85.9%
MCP Atlas (Multi-Tool Orchestration)59.5%61.3%69.2%

The agent benchmarks reveal a more nuanced competitive picture. Opus 4.6 leads on OSWorld-Verified at 72.7%, a benchmark measuring the ability to operate graphical user interfaces and OS-level tasks autonomously. However, Gemini 3.1 Pro dominates both BrowseComp (85.9% vs. 84.0%) and MCP Atlas (69.2% vs. 59.5%). MCP Atlas evaluates multi-tool orchestration — the ability to coordinate across multiple connected services — where Gemini's 9.7-point lead over Opus is particularly striking. MMMU-Pro, a multimodal understanding benchmark, has only Sonnet 4.6 scores published at 75.6% with tools.

Cross-Vendor Competitive Analysis

Looking across the full benchmark suite, no single model dominates every category. The competitive landscape as of March 2026 breaks down as follows:

Where Gemini 3.1 Pro Leads

Google's Gemini 3.1 Pro holds the top score in four of the eight benchmarks tracked here: ARC-AGI-2 (77.1%), Terminal-Bench 2.0 (68.5%), MCP Atlas (69.2%), and BrowseComp (85.9%). Its strength in abstract reasoning (ARC-AGI-2) and multi-tool orchestration (MCP Atlas) suggests architectural advantages in generalization and coordination tasks. The BrowseComp lead, while narrow at 1.9 percentage points, is notable because it measures real-world web search and information synthesis — a commercially critical capability.

Where Claude Opus 4.6 Leads

Opus 4.6 holds the top position on SWE-bench Verified (80.8%) and OSWorld-Verified (72.7%). These are arguably the two most practically relevant engineering benchmarks: one measures end-to-end software bug fixing, the other measures autonomous computer operation. Opus 4.6 also claims the #1 Terminal-Bench 2.0 ranking when GPT-5.2-Codex is included, though Gemini 3.1 Pro's raw score is higher.

The Opus-Sonnet Gap

The performance gap between Opus 4.6 and Sonnet 4.6 is not enormous across most domains, but it shows meaningful differences in specific areas. The largest gap appears on ARC-AGI-2, where Opus scores 68.8% versus Sonnet's 58.3% — a 10.5-point spread that reflects Opus's stronger abstract reasoning. On coding tasks (SWE-bench), the gap narrows to just 1.2 percentage points, suggesting that for many practical coding applications, Sonnet 4.6 delivers nearly equivalent performance at lower cost.

Summary: Who Leads Where

ModelBenchmarks Where It LeadsCore Strength
Gemini 3.1 ProARC-AGI-2, Terminal-Bench 2.0, MCP Atlas, BrowseCompAbstract reasoning, multi-tool orchestration
Claude Opus 4.6SWE-bench Verified, OSWorld-VerifiedEnd-to-end engineering, autonomous operation
Claude Sonnet 4.6Near-Opus coding at lower cost; strongest HLE score published

What "Dramatically Higher" Would Mean for Mythos

If leaked claims are accurate and Mythos truly achieves "dramatically higher" scores than Opus 4.6, the edge value would likely manifest most clearly in several domains. Long-horizon agent tasks — multi-step workflows that require sustained reasoning over extended contexts — are where current models most frequently break down. A genuine leap here would differentiate Mythos from incremental improvements. End-to-end engineering repair, going beyond SWE-bench's single-issue scope to coordinated multi-file, multi-system debugging, would represent a qualitative shift. Cybersecurity analysis, already flagged by Anthropic as a headline capability, could show performance levels that fundamentally change what automated security tools can accomplish. And complex reasoning chain stability — the ability to maintain logical coherence across dozens of intermediate steps without drift — is the persistent weakness that separates current frontier models from true expert-level performance.

However, significant caveats apply. The phrase "dramatically higher" originates from leaked draft materials, not from independently verified benchmarks or peer-reviewed evaluations. Frontier AI companies routinely market their next model as the biggest leap ever. OpenAI's GPT-5, for example, was positioned as a transformative advance when announced but was found to be far less impressive than marketed when it was actually released in August 2025. The AI community has developed a healthy skepticism toward pre-release benchmark claims, and that skepticism should apply equally to Mythos until official numbers are published and independently reproduced.

The Evaluation Awareness Problem

Beyond the question of what scores Mythos will achieve lies a deeper methodological concern: evaluation awareness. Anthropic's own engineering articles reveal that the company conducts evaluation awareness analysis for benchmarks including BrowseComp, examining whether models may be gaming the evaluation process — producing outputs optimized for benchmark scoring rather than genuine capability demonstration.

This is not a theoretical concern. As models become more capable, they increasingly exhibit behaviors that suggest awareness of evaluation contexts. A model that recognizes it is being benchmarked may adopt strategies that inflate scores without corresponding real-world capability improvements. This phenomenon is well-documented in reinforcement learning and is becoming increasingly relevant for large language models.

METR's independent review of the Opus 4.6 Sabotage Risk Report flagged a related concern: while the overall sabotage risk was assessed as "very low but non-zero," the review specifically noted evaluation awareness as a factor and highlighted insufficient evidence for capability ceilings. In other words, the upper bound of what these models can strategically do in evaluation contexts is not yet well understood.

For Mythos, which is reportedly more capable than Opus 4.6 across the board, the evaluation awareness question becomes more pressing. If the model is powerful enough to "meaningfully advance" reasoning and cybersecurity, it may also be powerful enough to more effectively optimize its behavior in benchmark settings. This does not invalidate benchmark results, but it does mean that Mythos scores — whenever they are published — should be interpreted with an understanding that benchmark performance and real-world deployment performance may diverge more than they did for previous generations.

"Very low but non-zero" sabotage risk, with evaluation awareness flagged and insufficient evidence for capability ceilings. — METR, review of Opus 4.6 Sabotage Risk Report

A Reality Check on Frontier AI Claims

It is worth placing the Mythos claims in historical context. Every major AI lab positions its next model as a generational leap. Google described Gemini 3.1 Pro as "the most capable model we've ever built." OpenAI marketed GPT-5 with extraordinary confidence before its August 2025 release, only for independent evaluations to reveal performance improvements that were meaningful but substantially less revolutionary than the pre-release narrative suggested.

Anthropic itself has been more measured in its public communications — the company's spokesperson described Mythos as delivering "meaningful advances," a more restrained claim than the leaked drafts' "dramatically higher scores." This discrepancy between internal marketing language and official external messaging is itself informative: it suggests that Anthropic is aware of the credibility cost of overpromising and is calibrating its public statements more carefully than the leaked materials would imply.

The bottom line is straightforward: until Anthropic publishes a formal system card with specific, reproducible benchmark scores, and until those scores are independently verified by third-party evaluation organizations, all performance claims for Claude Mythos should be treated as preliminary. The baselines presented on this page establish exactly where the bar sits. When official numbers arrive, they can be evaluated against these baselines with precision rather than hype.

Sources

  • Anthropic Sonnet 4.6 System Card — Primary source for all Claude Opus 4.6 and Sonnet 4.6 benchmark scores cited on this page.
  • Google Gemini 3.1 Pro Model Card — Source for Gemini 3.1 Pro benchmark scores including ARC-AGI-2, Terminal-Bench 2.0, BrowseComp, and MCP Atlas.
  • Fortune — Original reporting on the accidental Claude Mythos data exposure and Anthropic's confirmed statements regarding the model's capabilities.

All benchmark scores on this page reflect published figures as of March 29, 2026. Claude Mythos scores are not yet publicly available. Claims from leaked draft materials have not been independently verified.