Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Fine-tuning small, locally runnable models on a 1,100-example benchmark produces formal strategic time-based specifications from natural language as accurately as few-shot calls to much larger cloud models — when judged by a human-aligned evaluator — while keeping full control and transparency.

Key Findings

A hand-crafted dataset of 1,100 examples captures common linguistic ambiguities that break naive translations and lets models learn to produce well-formed strategic temporal rules. Small open-source models (3–7B parameters) fine-tuned on that data match the semantic accuracy of strong few-shot cloud models under the judge best aligned with human experts. Exact string match underestimates quality: many correct translations are paraphrases, and automatic judges vary in how they accept those paraphrases, with the best open judge tracking humans most closely.

Key Data

1Dataset: 1,100 expert-authored examples covering five ambiguity families; temporal-operator counts roughly X:590, G:590, F:598, U:586.
2Judge-recovery gap: automatic judge rescued up to 46% extra correct outputs for some proprietary few-shot runs (gpt-4.1); for fine-tuned open-weight models the judge-recovered share was 23–29%.
3Evaluation scale: 218 held-out test items, 4,279 judged predictions across six LLM judges plus human audit; fine-tuned configs repeated over three seeds for stability.

What This Means

Engineers building multi-agent systems and verification pipelines can use the dataset and framework to turn natural-language requirements into formal specs that feed model checkers. Technical leads and privacy-sensitive teams benefit from locally deployable, fine-tuned models that match cloud baselines while avoiding API exposure and giving better transparency and control.
Test your agentsValidate against real scenarios
Learn More

Key Figures

Figure 12 : Natural-language to ATL ⋆ translation surfaced inside the genVITAMIN interface.
Fig 12: Figure 12 : Natural-language to ATL ⋆ translation surfaced inside the genVITAMIN interface.
Figure 13 : Accuracy decomposition (Llama-3.3-70B judge). Semantic accuracy under the headline judge split into the deterministic exact-match floor (blue) and the additional fraction recovered by the LLM judge (orange), for each headline system (proprietary few-shot baselines; open-weight fine-tuned few-shot). The judge-recovered share (annotated) is largest for the proprietary baselines (up to 46 % 46\% for gpt-4.1 few-shot) and smaller for the fine-tuned open-weight systems ( 23 23 – 29 % 29\% ), whose outputs more often match the reference surface form. Exact match alone would substantially understate the quality of every system.
Fig 13: Figure 13 : Accuracy decomposition (Llama-3.3-70B judge). Semantic accuracy under the headline judge split into the deterministic exact-match floor (blue) and the additional fraction recovered by the LLM judge (orange), for each headline system (proprietary few-shot baselines; open-weight fine-tuned few-shot). The judge-recovered share (annotated) is largest for the proprietary baselines (up to 46 % 46\% for gpt-4.1 few-shot) and smaller for the fine-tuned open-weight systems ( 23 23 – 29 % 29\% ), whose outputs more often match the reference surface form. Exact match alone would substantially understate the quality of every system.
Figure 14 : Semantic accuracy (under the headline Llama-3.3-70B judge) versus mean inference latency. Open-weight models are circles, proprietary API baselines diamonds; the dashed line is the Pareto frontier, which under the human-aligned judge is shared : the few-shot proprietary baselines ( gpt-5.4 , gpt-4.1 ) occupy the high-accuracy end and the fast fine-tuned open-weight systems ( phi3 , qwen-3b ) the low-latency end, while the most accurate open-weight system ( qwen-coder-7b ) sits just off the frontier, dominated by gpt-5.4 . API latency is network-timed and only indicative across the API/local boundary; accuracy is comparable throughout.
Fig 14: Figure 14 : Semantic accuracy (under the headline Llama-3.3-70B judge) versus mean inference latency. Open-weight models are circles, proprietary API baselines diamonds; the dashed line is the Pareto frontier, which under the human-aligned judge is shared : the few-shot proprietary baselines ( gpt-5.4 , gpt-4.1 ) occupy the high-accuracy end and the fast fine-tuned open-weight systems ( phi3 , qwen-3b ) the low-latency end, while the most accurate open-weight system ( qwen-coder-7b ) sits just off the frontier, dominated by gpt-5.4 . API latency is network-timed and only indicative across the API/local boundary; accuracy is comparable throughout.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The benchmark is curated by experts and English-only, so performance on real-world, messy industrial requirements or other languages is not guaranteed. Evaluation depends on LLM-based judges whose agreement with humans is imperfect; reported accuracy should be read as calibrated estimates rather than absolute truth. The work targets a specific family of temporal strategic logic, so requirements needing richer notions (like knowledge, beliefs, or more expressive strategy logics) are out of scope temporal strategic logic.

Deep Dive

A 1,100-instance, expert-validated dataset was created to teach models how to translate natural-language strategic requirements into precise time-based rules for groups of agents. Examples explicitly cover five tricky ambiguity types (right/left dislocation, verb-phrase ellipsis, right-node raising, and quantifier scope ambiguity) and an approximately balanced set of temporal operators so models learn robust mappings rather than surface templates. The dataset entries pair plain-English requirements with one or more canonical formal formulas; when two interpretations are valid (e.g., quantifier-scope ambiguity), both are stored as gold outputs. The open-source framework runs local open-weight models and cloud APIs under the same interface, applies a strict normalization-based exact-match check first, and then uses multiple LLMs as judges to recover meaning-level equivalence when surface forms differ. Fine-tuning small models (3–7B) on the in-domain data and repeating training across seeds produced semantic accuracy comparable to few-shot prompting of much larger proprietary models, under the judge that best matched human experts. Practical takeaways: exact string match alone understates real quality because many correct formalizations are paraphrases; locally fine-tuned models give deployment and privacy advantages; and automatic judges need their own validation because stronger generator models tended to over-reject faithful paraphrases. open-weight models and local open-weight models offer practical benefits for deployment and privacy.
Not sure where to start?Get personalized recommendations
Learn More
Credibility Assessment:

ArXiv preprint, no affiliations listed, and authors have low h-indexes (<10), indicating limited established reputation.