Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

Key Takeaway

An agent-driven workflow can convert full C interpreter programs into safe Rust that pass their test suites, require only a few short human decisions, and eliminate many memory-safety bugs.

Key Findings

An automated system called Reboot translated six real-world C interpreters (6k–23k lines) into safe Rust that contains no unsafe blocks and that passed 100% of the provided test suites. When checked with independent, unseen validation tests the translations still passed at rates between 62% and 92% depending on the project. The system uses a multi-agent setup that validates results continuously and a ‘‘feature reduction’’ strategy that simplifies the interpreter into runnable milestones so agents can safely restructure code for Rust ownership rules; translations took 28–90 hours and required only 1–11 brief human interventions.

Data Highlights

1All six translated interpreters passed 100% of their provided test suites after translation.
2Independent validation (unseen) test pass rates ranged from 62% to 92% across benchmarks.
3Each translation required 28–90 hours, cost $460–$1,780, and needed 1–11 human interventions (about 5 minutes each).

What This Means

Engineers and technical leaders planning migrations of security-critical runtimes or embedded script engines — especially teams maintaining interpreters written in C — can use this to reduce memory-safety risk with little manual effort. Researchers and tool builders working on agent-to-agent protocols code transformation should care because the work shows how multi-agent checks and feature-based decomposition improve end-to-end correctness.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1 . The workflow of the translation process using Reboot .
Fig 1: Figure 1 . The workflow of the translation process using Reboot .
Figure 2 . An example of changes in source code as well as the test suite across feature levels during translation.
Fig 2: Figure 2 . An example of changes in source code as well as the test suite across feature levels during translation.
Figure 3 . Comparison of the regex compilation state in C ( struct cstate ) and Rust ( CompileState ). Corresponding fields are connected by lines. ➀ Fields for the output program and arena allocator are absent in Rust, replaced by local variables and individual heap allocations. ➁ Fields for setjmp / longjmp error handling are absent in Rust, replaced by Result -based error propagation.
Fig 3: Figure 3 . Comparison of the regex compilation state in C ( struct cstate ) and Rust ( CompileState ). Corresponding fields are connected by lines. ➀ Fields for the output program and arena allocator are absent in Rust, replaced by local variables and individual heap allocations. ➁ Fields for setjmp / longjmp error handling are absent in Rust, replaced by Result -based error propagation.
Figure 7 . Three-level implementation architecture of Reboot . The L3 Phase Controller is long-lived and manages the overall three-phase process. L2 components (Branch Controller and Manager Agent) are created for each feature level. L1 Worker Agents persist for the duration of that feature level to handle multiple tasks, and get restarted when faulty. When the Manager escalates an issue, it reaches the User Delegator Agent at L3, which auto-resolves common patterns and forwards only unresolved cases to the human user.
Fig 7: Figure 7 . Three-level implementation architecture of Reboot . The L3 Phase Controller is long-lived and manages the overall three-phase process. L2 components (Branch Controller and Manager Agent) are created for each feature level. L1 Worker Agents persist for the duration of that feature level to handle multiple tasks, and get restarted when faulty. When the Manager escalates an issue, it reaches the User Delegator Agent at L3, which auto-resolves common patterns and forwards only unresolved cases to the human user.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Results depend heavily on the provided test suites as the definition of correctness; untested behavior may remain semantically different. The approach currently targets standalone interpreters and full safe Rust (no unsafe code), so components that must use low-level unsafe idioms (like JITs) are out of scope. Outcomes are nondeterministic because of the underlying language model; runs can differ in cost, time, and exact code produced, and some security fixes may reflect LLM training data rather than pure translation logic. test suites

Full Analysis

Reboot translates C interpreter programs to safe Rust by combining two ideas: multi-agent orchestration and feature reduction. Feature reduction progressively strips cross-cutting language features (for example, regular expressions or exception handling) to produce a sequence of runnable, simpler interpreter versions. The system translates the simplest version first and then incrementally restores features, validating the program at each milestone against an adapted test suite. translator, validator, reviewer, and cleanup agents inside isolated containers, with a manager enforcing a finite-state workflow, automated recovery from common failures, and occasional brief escalations to a human for design choices. finite-state workflow. Evaluated on six open-source interpreters (6k–23k lines), Reboot produced safe Rust translations that passed all provided tests and achieved 62%–92% pass rates on independently created validation tests. The process eliminated many memory-related vulnerabilities in a security case study and produced median runtime slowdowns of about 1.28x–1.51x. Ablation showed feature reduction improved unseen-test pass rates by 6%–20% compared to using multi-agent orchestration alone. Translation runs cost roughly $460–$1,780 and took 28–90 wall-clock hours, with only a handful of short human interventions per project. The approach is promising for reducing memory risk when migrating interpreters, though it relies on test coverage and current large-language-model behavior.
Explore evaluation patternsSee how to apply these findings
Learn More
Credibility Assessment:

Authors include well-known researchers (Daniel Kroening, Prateek Saxena) indicating top-tier credibility.