Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Lower per-token prices can be misleading: invisible internal "thinking" tokens often dominate bills and can make a cheaper-listed model cost more in practice.

What They Found

Marked differences in internal reasoning length — what the authors call "thinking tokens" — drive large gaps between listed API prices and actual bills. Across 8 leading reasoning models and 9 tasks, cheaper list prices led to higher real costs in many pairwise comparisons. Removing thinking-token costs largely restores price-to-cost ranking, showing thinking tokens are the main culprit. Predicting per-query cost ahead of time is hard because thinking-token usage varies a lot, even across repeated runs of the same query thinking tokens.
Not sure where to start?Get personalized recommendations
Learn More

By the Numbers

121.8% of model-pair comparisons show a price reversal where the model with the lower listed price ends up costing more.
2Removing thinking-token costs cuts ranking reversals by 70% and raises Kendall’s τ between price and cost from 0.563 to 0.873.
3Within-query variance creates an irreducible noise floor: repeated runs show max/min thinking-token ratios up to 9.7× and per-query variability (coefficient of variation) measurable across models.

What This Means

Engineers and product teams selecting models for cost-sensitive workloads should run representative cost audits instead of relying on list prices. Cloud procurement and platform teams should push for per-request cost breakdowns from vendors to avoid surprise bills. Model providers should consider exposing expected thinking-token overhead or dedicated cost-estimation APIs cost audits.

Key Figures

Figure 1 : The phenomenon of mismatch between AI model pricing and their actual costs. (a) On the same user workloads, AI models with lower listed prices may incur much higher expenses than those with higher prices. For example, Gemini 3 Flash’s list price ($3.5/1 million tokens) is 78% cheaper than that of GPT-5.2 ($15.75), but its actual cost ($643) is actually 22% higher than GPT-5.2 ($527). (b) This dramatically changes the cost ranking and poses a pressing challenge to cost-sensitive users. For example, one might choose GPT-5 Mini over Claude Haiku 4.5 due to its listed lower price, but recognize later that it is 43% more expensive on her workload.
Fig 1: Figure 1 : The phenomenon of mismatch between AI model pricing and their actual costs. (a) On the same user workloads, AI models with lower listed prices may incur much higher expenses than those with higher prices. For example, Gemini 3 Flash’s list price ($3.5/1 million tokens) is 78% cheaper than that of GPT-5.2 ($15.75), but its actual cost ($643) is actually 22% higher than GPT-5.2 ($527). (b) This dramatically changes the cost ranking and poses a pressing challenge to cost-sensitive users. For example, one might choose GPT-5 Mini over Claude Haiku 4.5 due to its listed lower price, but recognize later that it is 43% more expensive on her workload.
Figure 2 : The ranking inversion phenomenon. Overall, we observe that the listed price rankings systematically mismatch the actual costs. In addition, the actual cost rankings vary substantially across different tasks. This suggests that standard assessment according to a fixed listed API pricing is misleading.
Fig 2: Figure 2 : The ranking inversion phenomenon. Overall, we observe that the listed price rankings systematically mismatch the actual costs. In addition, the actual cost rankings vary substantially across different tasks. This suggests that standard assessment according to a fixed listed API pricing is misleading.
Figure 3 : Cost and token consumption breakdown by token types. Thinking tokens dominate both token volume and total cost for most models, establishing them as the primary candidate for explaining pricing reversal.
Fig 3: Figure 3 : Cost and token consumption breakdown by token types. Thinking tokens dominate both token volume and total cost for most models, establishing them as the primary candidate for explaining pricing reversal.
Figure 4 : Case study: on the same AIME problem, GPT-5.2 uses 562 thinking tokens while Gemini 3 Flash uses over 11,000, leading to 2.5 × \times higher actual cost despite lower API pricing. The mechanism of reversal is the enormous cross-model variance in thinking token consumption.
Fig 4: Figure 4 : Case study: on the same AIME problem, GPT-5.2 uses 562 thinking tokens while Gemini 3 Flash uses over 11,000, leading to 2.5 × \times higher actual cost despite lower API pricing. The mechanism of reversal is the enormous cross-model variance in thinking token consumption.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The study covers 8 current reasoning-focused models and 9 diverse benchmarks, so results are strong for reasoning-heavy tasks but may not generalize to every model or workload. Actual billing details and model tuning options vary by provider and may change over time, affecting observed reversals. Per-query cost prediction remains difficult because part of the token-usage variance is intrinsic and not predictable from the query alone. per-query cost prediction.

Methodology & More

Researchers measured listed API prices against actual billed costs across 8 prominent reasoning language models and 9 task suites (math contests, science QA, visual puzzles, coding, etc.). They decomposed cost into prompt tokens (what you send), thinking tokens (internal deliberation produced during inference but billed), and generation tokens (the visible reply). For most models and tasks, thinking tokens made up the bulk of output tokens and cost; differences in thinking-token volume explain most misrankings where the cheaper-listed model actually costs more. An ablation that removes thinking-token charges dramatically reduces ranking reversals (70% fewer) and improves correlation between list price and real cost. The team also tried simple prediction baselines to forecast per-query cost (mean, prompt-length regression, and embedding-based nearest neighbors). Results show modest gains on low-variance models but poor performance on high-variance models because thinking-token usage fluctuates widely—even repeated runs of the same query can differ by nearly 10×. The practical takeaways: run workload-specific cost audits for reasoning-heavy applications, press vendors for per-request cost visibility or estimation APIs, and treat per-query cost prediction as an open technical challenge. Reflection Pattern and capability-discovery-pattern.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Contains well-known researchers (e.g., Matei Zaharia, Ion Stoica) and multiple authors with moderate h-indices; although an arXiv preprint, author reputation raises credibility to an established level.