At a Glance
AI agents struggle to reason across long ICU records: boosting action coverage often produces many unsafe recommendations, and adding structured memory helps but does not remove safety risks.
ON THIS PAGE
What They Found
RealICU measures how well language-model agents support four bedside tasks—assessing status, spotting acute problems, recommending actions, and flagging unsafe moves—using hindsight physician labels. Models that recommend more actions tend to produce a large share of potentially harmful suggestions, and agents often stick to early impressions even when later data contradicts them. A structured memory agent that tracks recent observations, trends, critical events, summaries, and patient-specific insights improves reasoning but still leaves unacceptable safety failures.
Not sure where to start?Get personalized recommendations
By the Numbers
1Up to 47.3% of recommended actions were flagged as potentially harmful when models increased recall.
2ICU-Evo with a top model reached only 0.459 accuracy on Patient Status prediction.
3RealICU-Scale provides 11,862 hindsight-labeled 30-minute windows (RealICU-Gold has 930 physician-labeled windows).
What This Means
Engineers building clinical decision support should care because the benchmark exposes real failure modes that matter at the bedside and shows pre-deployment checks and to prioritize interventions that reduce unsafe recommendations rather than only improving coverage.
Key Figures

Fig 1: Figure 1 : ICU decisions are made under massive data volume and time pressure. An ICU AI co-pilot integrates data streams into a decision-support panel that assesses Patient Status , identifies Acute Problems , proposes Recommended Actions , and warns against unsafe Red Flag actions.

Fig 2: Figure 2 : Left: Data pipeline for RealICU-Gold and RealICU-Scale . Right: Data samples for a patient ICU trajectory. For each evaluation window, RealICU provides raw observation data and clinical labels, including patient status, acute problems, action recommendation, and red flag action.
![Figure 3 : Temporal performance on RealICU-Scale (Gemini-3.1-pro [ Gemini31Pro2026 ] ). ICU-Evo demonstrates its advantage on Patient Status and Acute Problems even up to 1,800-hour trajectory.](https://arxiv.org/html/2605.13542v1/2605.13542v1/x3.png)
Fig 3: Figure 3 : Temporal performance on RealICU-Scale (Gemini-3.1-pro [ Gemini31Pro2026 ] ). ICU-Evo demonstrates its advantage on Patient Status and Acute Problems even up to 1,800-hour trajectory.
![Figure 4 : Temporal performance over the full ICU stay on RealICU-Scale (GPT-5.4 [ OpenAIGPT54_2026 ] ).](https://arxiv.org/html/2605.13542v1/2605.13542v1/x4.png)
Fig 4: Figure 4 : Temporal performance over the full ICU stay on RealICU-Scale (GPT-5.4 [ OpenAIGPT54_2026 ] ).
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
RealICU is built from the MIMIC-IV single-center dataset, so results may not generalize to hospitals with different documentation or care patterns. The evaluation focuses on text-based EHR data and does not include imaging or waveform signals. Experiments ran a single trial per model configuration, so reported numbers lack measured variance across runs.
Methodology & More
RealICU is a new benchmark that evaluates language-model agents on four clinician-driven tasks across evolving ICU trajectories: Patient Status, Acute Problems, Recommended Actions, and Red Flag (unsafe) actions. Instead of treating recorded clinician actions as the ground truth, each evaluation window is labeled by hindsight physician judgment—what was actually best after seeing the full stay. The dataset has a small high-quality subset (RealICU-Gold, 930 windows) and a larger scale set (RealICU-Scale, 11,862 windows) generated with a physician-validated evaluator to expand coverage.
Agents were tested under several context strategies (all prior observations, only the local 30-minute window, or retrieved past windows). A proposed agent called ICU-Evo maintains a structured memory with five components—working memory, trend memory, critical-event log, compressed trajectory summaries, and patient-specific insights—to approximate the evolving patient state. Across frontier models [frontier models], and setups, performance remains low (for example, 0.459 accuracy on status and 0.534 recall@5 on actions for a top model) and two failure modes dominate: a recall–safety tradeoff (more recommendations often include many unsafe ones) and anchoring bias (failure to update early conclusions). The benchmark shows that longer context and structured memory improve some reasoning but are insufficient for safe ICU co-pilots, highlighting the need for stronger safety mechanisms, multi-center validation, and multimodal data before deployment. Simultaneously, there is room to explore Model Context Protocol (MCP) Pattern and related approaches to improve reliability in evolving ICU trajectories. And to better support capabilities in real settings, researchers can consider Capability Discovery Pattern to augment decisions with diverse, verifiable signals.
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Includes Daniel Rueckert (h-index ~14) and institutional signals; mixed author h-indices but presence of a mid‑career established researcher raises credibility.