How AI Can Tell Which Code Will Work Before Running It

The Big Picture

Large language models can predict which of two data-analysis solutions will perform better from reports and code, with modest accuracy and reliable confidence—enough to safely filter candidates and cut execution time drastically.

ON THIS PAGE

The Evidence

AI models trained to reason over data reports and code reach meaningful predictive power: the best model hit about 61.5% accuracy at picking the superior solution from pairs. Predictions improve when models see richer, verbalized data summaries rather than just code or raw stats, and the models report well-calibrated confidence that correlates with correctness. Using these predictions as a filter inside an agent loop (Predict-then-Verify) expands search and reduces physical execution time, producing both faster runs and slightly better final solutions.

Not sure where to start?Get personalized recommendations

Learn More

Data Highlights

118,438 pairwise comparisons comprise the evaluation corpus used to train and test prediction ability.

2Top reasoning model achieved 61.5% accuracy versus random guessing at 50.0% and a complexity-based heuristic at 50.8%.

3Integrating prediction into an agent (ForeAgent) expanded explored candidates by 3.2× and ran about 6× faster while delivering a +6% performance gain over the baseline.

What This Means

Engineers building autonomous agents and ML automation pipelines can use guardrails for model evaluation to prune expensive trial runs and explore more options in the same time budget. Technical leaders running large-scale model search or hyperparameter sweeps will benefit from reduced compute costs and faster iteration. Researchers working on agent evaluation and reward-model training can use the released corpus to bootstrap models that estimate execution outcomes without running code.

Key Figures

Figure 1: From Execution to Inference. Traditional ML agents improve through costly execution and external feedback, incurring substantial latency. Our work investigates whether superior data-grounded solutions can be identified before execution by leveraging “Implicit Execution Priors”.

Fig 1: Figure 1: From Execution to Inference. Traditional ML agents improve through costly execution and external feedback, incurring substantial latency. Our work investigates whether superior data-grounded solutions can be identified before execution by leveraging “Implicit Execution Priors”.

Fig 2: Figure 2: Overview of the Framework. (a) Task Definition: The Data-centric Solution Preference task predicts solution superiority and confidence via latent reasoning. (b-c) Data Curation: We collect and filter real-world agent trajectories to construct the Preference Corpus . (d) Augmentation: Inputs are augmented with Verified Data Reports via a “Profile-Verify-Verbalize” pipeline. (e) ForeAgent Application: The model serves as a filter within the Predict-then-Verify loop, predicting preference before physical execution to prune candidates.

Figure 3: Comprehensive Analysis of World Model Mechanisms and Capabilities. (a) Impact of Data Representation: Predictive success stems from semantic data understanding rather than complexity heuristics. (b) Domain Sensitivity: The superiority of verbal reports remains consistent across domains. (c) Scaling Laws: Accuracy decouples from pure parameter scaling. (d) Inference Dynamics: Active reasoning outperforms direct answering with robust stability across temperatures. (e) Calibration Analysis: Self-reported confidence strictly correlates with accuracy. (f) Complexity Discrimination: Accuracy scales with the complexity gap.

Fig 3: Figure 3: Comprehensive Analysis of World Model Mechanisms and Capabilities. (a) Impact of Data Representation: Predictive success stems from semantic data understanding rather than complexity heuristics. (b) Domain Sensitivity: The superiority of verbal reports remains consistent across domains. (c) Scaling Laws: Accuracy decouples from pure parameter scaling. (d) Inference Dynamics: Active reasoning outperforms direct answering with robust stability across temperatures. (e) Calibration Analysis: Self-reported confidence strictly correlates with accuracy. (f) Complexity Discrimination: Accuracy scales with the complexity gap.

$Figure 4: Agent Performance Analysis. (a) Task-wise Beat Ratio: ForeAgent achieves an average +6% improvement over the AIDE baseline. (b) Temporal Efficiency: The agent converges to peak performance using only 1/6 of the execution time, achieving an average 6 × 6\times speedup. (c) Search Breadth: By offloading evaluation to the “Implicit World Model”, ForeAgent explores 3.2 × 3.2\times more nodes on average compared to the baseline, significantly expanding the search space within the same time budget.$

Fig 4: Figure 4: Agent Performance Analysis. (a) Task-wise Beat Ratio: ForeAgent achieves an average +6% improvement over the AIDE baseline. (b) Temporal Efficiency: The agent converges to peak performance using only 1/6 of the execution time, achieving an average 6 × 6\times speedup. (c) Search Breadth: By offloading evaluation to the “Implicit World Model”, ForeAgent explores 3.2 × 3.2\times more nodes on average compared to the baseline, significantly expanding the search space within the same time budget.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Prediction accuracy is useful but far from perfect (~61%), so these models are best used as front-line filters rather than replacements for real execution. The dataset is dominated by common tasks like classification and regression, so predictive performance may drop on niche scientific or low-data domains. Results depend on the models and reasoning prompts tested; different models, data distributions, or prompt styles may change outcomes.

Methodology & More

The work frames a new task—Data-centric Solution Preference—where a model predicts which of two candidate solutions will perform better given the task description, a data analysis report, and the code. To study this, the authors compiled 18,438 verified pairwise comparisons from real agent trajectories. They evaluated modern large models with reasoning prompts and varied the input modality (code only, raw data, numerical stats, and verbal reports) to see how much semantic context helps prediction. Key findings show that richer, verbalized data reports boost predictive accuracy (verbal reports reached ~61.3% in some settings), indicating the models rely on semantic reasoning rather than shallow heuristics like code complexity. Confidence estimates from the models are well calibrated and can act as reliable gates. Putting the predictor inside an agent called ForeAgent (ForeAgent) (Predict-then-Verify) lets the system filter out low-probability candidates before executing them, which expanded search breadth by ~3.2× and reduced real execution time by around 6×, while producing a modest +6% improvement over a baseline agent. The dataset and verification traces are released to help train reward-style models and speed up agent rollouts, but practitioners should use prediction as a cost-saving filter, not a final arbiter, especially in niche tasks.

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Includes Huajun Chen (h-index ~31), a well-established researcher, giving the paper stronger credibility despite arXiv venue.

agent reliability multi-agent trust agent evaluation pre-production agent testing

Not sure where to start?