Stop AI Agents from Repeating Work and Cut Runtime

At a Glance

Pre-warm and globally reuse model prompt state across agent workflows to eliminate redundant model work—Helium speeds complex multi-step agent runs up to 1.56× while keeping exact outputs.

ON THIS PAGE

What They Found

Treat agent workflows as data pipelines and expose prompt structure to the serving layer so the system can reuse shared parts of prompts and intermediate results. A proactive key-value cache (pre-warmed for static prompt prefixes) plus a cost-aware optimizer and a templated radix tree for prompt structure lets the system avoid repeated LLM computation across operators, queries, and batches. That combination yields measurable end-to-end speedups while preserving exact semantics. The approach works best for on-prem deployments where repeated or batched agent workloads expose sharing opportunities. For inspiration on structuring complex reasoning within prompts, consider the Tree of Thoughts Pattern.

Need expert guidance?We can help implement this

Learn More

Key Data

1Up to 1.56× end-to-end speedup on primitive workflows compared to state-of-the-art baselines.

2Up to 1.34× speedup on a complex financial analysis (trading) workflow that mixes parallel, debate, and map-reduce patterns.

3Proactive caching targets static prompt prefixes (examples used >200 tokens), enabling reuse of prefilled model state across batches.

Implications

ML infrastructure engineers and platform teams running multi-step agent applications will see lower GPU work and faster runs by adding workflow-aware caching and scheduling. Product and engineering leads building agent orchestration can use these ideas to reduce cost and tail latency without changing model outputs. Researchers working on agent reliability and multi-agent orchestration can adopt the optimizer and prompt-structure ideas to improve throughput for evaluation and continuous testing. See how governance and evaluation can be strengthened with Evaluation-Driven Development (EDDOps).

Key Figures

Fig 1: Figure 1. Three disparities between traditional SQL pipelines and agentic workflows with LLM as operators.

Fig 2: Figure 2. Each representative agentic workflow demonstrates a primitive pattern in agent interactions.

Fig 3: Figure 3. Overview of Helium ’s architecture.

Fig 4: Figure 4. A workflow DAG (top) and the corresponding templated radix tree with cache-aware schedule (bottom)

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Helium assumes on-prem or managed GPU deployments and that workflows use the same base model; it does not target remote API-based LLMs. Cache benefits depend on repetition: highly unique or extremely dynamic prompts offer less reuse. Caching is restricted to deterministic operators (e.g., greedy decoding) and is subject to GPU memory limits and eviction behavior, so performance varies with cache capacity and workload mix. Insights from Consensus-Based Decision Pattern can inform how to reason about caching under conflicting operator requirements.

Methodology & More

Helium models multi-step agent workflows as directed graphs where each node is an operator that may invoke the model. Instead of treating every model call as an independent black box, it applies classic query-optimization ideas: remove dead branches, merge identical subgraphs, and replace computations with cache lookups when inputs are identical. During compilation Helium builds a templated radix tree that captures shared prompt prefixes (both static text and placeholders filled by other operators). That tree guides a cache-aware scheduler which assigns calls to workers and orders execution to maximize reuse of prefilled model state. At runtime Helium proactively pre-warms a global key-value cache for static prompt prefixes and maintains a prompt-output cache for deterministic operators. The optimizer rewrites the logical plan to insert cache fetches where possible, turning expensive LLM operators into cheap data retrievals. For scheduling it uses a cost model based on token work and precedence delays; the exact scheduling problem is hard, so Helium uses a greedy, cache-aware heuristic over the operator-level tree. Evaluations show up to 1.56× speedups on simple patterns and 1.34× on a complex trading workflow, with negligible planning overhead and identical outputs. Practical trade-offs include on-prem deployment, reliance on workload repeatability, and limited applicability to non-deterministic decoding or remote API models. See how these ideas align with LLM-as-Judge Pattern and consider how an Agent Registry Pattern could help manage operator-level caching and scheduling across teams.

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

ArXiv preprint, no affiliations provided and authors have very low h-indices. Insufficient reputation or venue signals — likely emerging.

multi-agent orchestration proactive caching agentic workflows

Not sure where to start?