Key Takeaway
Separate the 'what' from the 'how': having a language model list the concepts first, then using a layout engine to render them, produces large numbers of accurate, diverse diagrams cheaply.
ON THIS PAGE
Key Findings
An agent that first extracts domain knowledge as a set of conceptual ‘ideas’ and then hands those ideas to a dedicated two-stage approach renderer can produce textbook-level diagrams at scale. Using this [two-stage approach], the system generated 10,693 unique conceptual programs and 106,930 rendered diagrams (10 variations each) while keeping costs very low. The produced images are more faithful to intended concepts than direct image generation or direct code synthesis approaches, and the pipeline yields aligned image–caption pairs for dataset creation and evaluation.
Data Highlights
110,693 unique conceptual programs produced, rendered into 106,930 diagrams (10 variations per program).
2About 1.55 billion model tokens consumed (≈1,470M input + 46.6M output) at a reported cost under $400 using GPT-4o-mini.
3Compared to baselines: a Ti k Z code model only compiled correctly for 3 of 5 test prompts, while direct image diffusion models produced visually messy or concept-missing outputs.
Why It Matters
Engineers building multimodal AI systems or agents that need reliable visual outputs can use this pattern to generate high-quality, concept-aligned diagrams. Teams creating training or evaluation datasets (educational content, scientific visuals, diagram benchmarks) can cheaply synthesize large, diverse, and captioned diagram corpora. Product and research leads can adopt the two-stage idea-then-render workflow to improve fidelity over single-step image or code generation.
Explore evaluation patternsSee how to apply these findings
Key Figures

Fig 1: Figure 1: The Feynman Agent

Fig 2: Figure 2(()): Idea step : In the first step, Feynman enumerate the knowledge given a specific domain.

Fig 3: Figure 3: Iterate Step : At each step, Feynman attempts to write Penrose program to create a diagram. The generated program is then compiled into images and sent to a panel of visual judges (MLLMs) for critical feedback. We term this algorithm Iterative Visual-Refine ( Algorithm ˜ 1 ).

Fig 4: Figure 4: Examples of conceptual diagrams and their Substance notations : a graph where node connections form a cube (left) and the Lewis structure of the formaldehyde molecule ( CH 2 O \mathrm{CH_{2}O} ).
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
The approach relies on the language model producing correct domain knowledge; errors in the 'ideas' step will propagate into visuals. The renderer targets conceptual, vector-style diagrams and may not suit photorealistic or highly detailed raster imagery. Results and human-readability were tested mainly on scientific and math-style diagrams; some domains showed weaker scaling when domain knowledge is sparse, requiring more curation. For potential pitfalls, consider Context Drift.
Full Analysis
Feynman uses a four-step pipeline — idea, plan, iterate, render — that intentionally separates knowledge elicitation from visual production. First, a language model enumerates the relevant concepts and relationships for a target diagram (the "Substance"). Those abstract concepts are translated into a structured program that encodes entities and relations but not low-level drawing details. The system then uses an optimization-driven diagram renderer that maps each concept to geometric shapes and layout constraints, producing multiple visual variations while preserving semantics. To improve quality, the pipeline iteratively refines generated programs using visual feedback from multimodal models acting as judges. The team produced 10,693 Substance programs and rendered 10 variations each (106,930 images), paired with captions and question–answer items to form a benchmark called Diagramma. The workflow proved far more reliable than asking a diffusion model to draw diagrams or forcing a language model to output low-level drawing code. Economically, the approach scaled to over a billion tokens with reported cost under $400, making it practical for synthetic dataset creation. Limitations include dependency on the language model's factual accuracy, renderer scope (conceptual/vector diagrams), and the need for domain-specific checks in knowledge-sparse areas. For the rendering path, a diagram renderer and an orchestration approach helped coordinate components.
Test your agentsValidate against real scenarios
Credibility Assessment:
Authors affiliated with Carnegie Mellon (a top university) and includes recognized researchers; however venue is arXiv rather than a top conference, so high but not top rating.