Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Combine supervised fine-tuning with a critic-free reinforcement method and renderer-in-loop inference to get much better, more reliable programmatic animations—code fixes help, but renderer feedback improves visual quality the most.

Key Findings

Supervised fine-tuning mainly teaches the model the Manim API and improves code-level scores, while a critic-free reinforcement method that uses rendered video feedback raises visual quality and render success. Adding an iterative renderer-in-loop correction step, and optionally API-documentation in the loop, produces the largest gains at inference time. Code-only metrics do not reliably predict visual output, so evaluating both code and rendered video is essential. Evaluation-Driven Development (EDDOps) and Model Context Protocol (MCP) Pattern provide frameworks that can guide how these components interact during development and evaluation.

Key Data

1Top overall result: 85.7% visual similarity and 94% render success rate achieved by a 30B model fine-tuned with reinforcement learning and three renderer-in-loop + documentation cycles; this beat GPT-4.1 by +3.8 and +2 percentage points respectively.
2SeedCoder 8B fine-tuned with the reinforcement method reached a 72% render success rate and was one of the best visual performers, outperforming an 80B baseline under vanilla inference.
3Across 17 models, supervised fine-tuning favored better code scores while reinforcement fine-tuning favored visual scores; under vanilla inference 8 models performed best after reinforcement fine-tuning and 9 after supervised fine-tuning.

What This Means

Engineers building automated educational video or animation pipelines who need precise, reproducible visuals rather than noisy diffusion video models. ML engineers and product leads evaluating how to fine-tune compact language models for code-to-video tasks—follow a two-stage strategy (fine-tune for syntax, then visually ground with renderer feedback) and add renderer-based correction at inference for the biggest payoff. A two-stage strategy can be aligned with a Consensus-Based Decision Pattern to structure how decisions are made when integrating these components.
Test your agentsValidate against real scenarios
Learn More

Key Figures

Figure 1: Overview of the ManimTrainer pipeline. The pipeline uses a quantised base LLM and visually grounds it for Manim code generation, using both visual and text reward signals.
Fig 1: Figure 1: Overview of the ManimTrainer pipeline. The pipeline uses a quantised base LLM and visually grounds it for Manim code generation, using both visual and text reward signals.
Figure 2: Inference Pipeline of the ManimAgent.
Fig 2: Figure 2: Inference Pipeline of the ManimAgent.
Figure 3: Behaviour of CodeBERTBLEU score against Visual Similarity during the training cycles.
Fig 3: Figure 3: Behaviour of CodeBERTBLEU score against Visual Similarity during the training cycles.
Figure 4: Scaling Behaviour of the Visual Similarity in the fine-tuned LLMs.
Fig 4: Figure 4: Scaling Behaviour of the Visual Similarity in the fine-tuned LLMs.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Results are reported on a Manim-focused benchmark and on sub-30B open models; behavior may change at much larger model scales or with different animation libraries. Renderer-in-loop inference increases compute and latency because the model must compile and render to get feedback. The RITL-DOC approach used a rule-based API retriever; a learned retriever could change results, especially for smaller models working with longer context. Consider potential failure modes such as Graceful Degradation Failure when scaling or changing the pipeline.

Deep Dive

Combine supervised fine-tuning (to teach Manim vocabulary and syntax) with a critic-free reinforcement learning method that optimizes a single reward mixing code-level similarity and visual similarity between rendered videos. Models were trained and evaluated on a Manim animation benchmark across 17 open-source models from 0.5B to 30B parameters. The reinforcement step uses executed renders to compute visual rewards, so the model learns to produce code that not only compiles but also matches the target visuals. At inference time, run iterative renderer-in-loop correction: generate code, try to render, collect renderer/compile failures or visual differences, then ask the model to fix the code. Adding relevant Manim API documentation into the correction loop (RITL-DOC) further improves fixes for API or usage errors. Empirically, supervised fine-tuning raised code similarity metrics while the reinforcement cycle raised visual similarity and render success. Iterative renderer-based correction gave the largest single improvement, and combining all steps produced the top scores (85.7% visual similarity, 94% render success). Importantly, code-only metrics are a weak proxy for final visual quality, so include rendered outputs in evaluation and consider learned retrievers or multimodal critics for future improvements. Orchestrator-Worker Pattern
Not sure where to start?Get personalized recommendations
Learn More
Credibility Assessment:

No affiliations listed and low author h-indices (≤7); arXiv preprint with no citation signal—limited information and emerging credibility.