Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

Key Takeaway

AD-CARE is an AI agent that combines whatever patient data is available (imaging, tests, genetics) to produce guideline-aligned Alzheimer’s assessments that are more consistent across centers and help clinicians diagnose faster.

Core Insights

AD-CARE adapts to incomplete, real-world clinical data by planning which specialized tools to run and then assembling a guideline-concordant report. Evaluated on 10,303 cases from six cohorts, it achieves stable diagnostic performance across diverse sites and reduces variability across racial and age subgroups. In a reader study, clinicians who used AD-CARE made more accurate diagnoses and took less time per case. The system also works with a range of language model backbones, including smaller, lower-cost models, keeping performance gains while reducing deployment cost. Tool Use Pattern

Data Highlights

1Evaluated on 10,303 clinical cases pooled from six cohorts (four public, two in-house).
2Benchmarked with eight different language-model backbones; AD-CARE consistently improved accuracy over raw model outputs for every backbone tested.
3Reader study showed clinician performance gains and shorter per-case read times when clinicians used AD-CARE (statistically significant improvements reported in the study).

What This Means

Clinical AI engineers and product leads who need a practical way to combine imaging, cognitive tests, and genetics into a single, explainable diagnostic workflow should pay attention—AD-CARE shows a path to robust, deployable decision support. Neurologists, radiologists, and health system leaders evaluating tools to reduce diagnostic variability and speed case review can use AD-CARE to augment accuracy and efficiency in memory clinics. Emergence-Aware Monitoring Pattern
Explore evaluation patternsSee how to apply these findings
Learn More

Key Figures

Figure 1: AD-CARE Agent framework and overall strategy. (a) , Our AD-CARE for AD diagnosis was developed using multi-modal data, including individual-level demographics, imaging, neurological tests, genetic information, functional evaluation, and biospecimen results. Multi-modal data are processed through the AD-CARE with three components (reasoning engine, outcome aggregator, and specialized executors), generating multi-modal outputs (disgnosis result, confidence, diagnosis report, and visualization results). (b) , Agent workflow: Given a use query, the framework performs reasoning in four stages: (i) observation, (ii) thought, (iii) action, and (iv) aggregation. (c) , Validation on six diverse populations (n=10,303) including four public datests and two in-house cohorts: We first evaluated AD-CARE against baseline methods using four metrics. We then assessed fairness with respect to race and age. Next, we conducted reader study with agent augmentation. Finally, we benchmarked AD-CARE by using eight representative LLM backbones.
Fig 1: Figure 1: AD-CARE Agent framework and overall strategy. (a) , Our AD-CARE for AD diagnosis was developed using multi-modal data, including individual-level demographics, imaging, neurological tests, genetic information, functional evaluation, and biospecimen results. Multi-modal data are processed through the AD-CARE with three components (reasoning engine, outcome aggregator, and specialized executors), generating multi-modal outputs (disgnosis result, confidence, diagnosis report, and visualization results). (b) , Agent workflow: Given a use query, the framework performs reasoning in four stages: (i) observation, (ii) thought, (iii) action, and (iv) aggregation. (c) , Validation on six diverse populations (n=10,303) including four public datests and two in-house cohorts: We first evaluated AD-CARE against baseline methods using four metrics. We then assessed fairness with respect to race and age. Next, we conducted reader study with agent augmentation. Finally, we benchmarked AD-CARE by using eight representative LLM backbones.
Figure 3: Fairness analysis of AD-CARE and baseline methods. (a) , Racial subgroups (Asian, Black, White). (b) , Age subgroups (<65, 65–74, 75–84, ≥ \geq 85). Bars show subgroup performance on four metrics. Lines (right axis) show fairness dispersion (standard deviation and max–min gap across subgroups). AD-CARE delivers both high diagnostic performance and lower variability across demographic groups compared with baseline methods, indicating improved robustness and fairness across race and age.
Fig 2: Figure 3: Fairness analysis of AD-CARE and baseline methods. (a) , Racial subgroups (Asian, Black, White). (b) , Age subgroups (<65, 65–74, 75–84, ≥ \geq 85). Bars show subgroup performance on four metrics. Lines (right axis) show fairness dispersion (standard deviation and max–min gap across subgroups). AD-CARE delivers both high diagnostic performance and lower variability across demographic groups compared with baseline methods, indicating improved robustness and fairness across race and age.
Figure 4: AD-CARE assistance improves clinicians’ diagnostic accuracy and efficiency. (a) , Diagnostic performance of neurologists and radiologists with and without agent assistance, stratified by seniority level. Points denote mean performance estimates for doctor-only reads and doctor-plus-agent reads, and error bars indicate 95% confidence intervals obtained by bootstrap resampling. Metrics include accuracy, F1 score, sensitivity and specificity. Across both specialties and experience levels, access to the agent yields consistent gains in all metrics. (b) , Effect of AD-CARE assistance on per-case reading time. Violin plots depict the distribution of decision times for unaided clinicians (blue) and agent-assisted clinicians (orange) overall and within each subgroup and site (SYSUH neurologists, XWH radiologists). Boxes summarize median and mean times, and inset annotations report mean ± 95% CI and the corresponding efficiency gain (ratio of unaided to assisted time). AD-CARE assistance substantially shortens reading time for both neurologists and radiologists while preserving or improving diagnostic performance.
Fig 3: Figure 4: AD-CARE assistance improves clinicians’ diagnostic accuracy and efficiency. (a) , Diagnostic performance of neurologists and radiologists with and without agent assistance, stratified by seniority level. Points denote mean performance estimates for doctor-only reads and doctor-plus-agent reads, and error bars indicate 95% confidence intervals obtained by bootstrap resampling. Metrics include accuracy, F1 score, sensitivity and specificity. Across both specialties and experience levels, access to the agent yields consistent gains in all metrics. (b) , Effect of AD-CARE assistance on per-case reading time. Violin plots depict the distribution of decision times for unaided clinicians (blue) and agent-assisted clinicians (orange) overall and within each subgroup and site (SYSUH neurologists, XWH radiologists). Boxes summarize median and mean times, and inset annotations report mean ± 95% CI and the corresponding efficiency gain (ratio of unaided to assisted time). AD-CARE assistance substantially shortens reading time for both neurologists and radiologists while preserving or improving diagnostic performance.
Figure 5: Benchmark comparison of AD-CARE with raw LLM backbones and cost–accuracy trade-off analysis. (a) , Accuracy of standalone LLM backbones versus the corresponding LLM-powered AD-CARE system. Numbers indicate the absolute accuracy gain achieved by AD-CARE over the raw LLM output for each backbone. Across all eight models, AD-CARE consistently improves diagnostic accuracy, demonstrating the effectiveness of the framework. (b) , AD-CARE accuracy versus overall inference cost for each instantiated backbone. Point colors denote the LLM provider. The dashed line denotes the Pareto frontier of non-dominated backbones, for which no alternative achieves higher accuracy at lower cost. Bubble diameter is proportional to the relative improvement ratio.
Fig 4: Figure 5: Benchmark comparison of AD-CARE with raw LLM backbones and cost–accuracy trade-off analysis. (a) , Accuracy of standalone LLM backbones versus the corresponding LLM-powered AD-CARE system. Numbers indicate the absolute accuracy gain achieved by AD-CARE over the raw LLM output for each backbone. Across all eight models, AD-CARE consistently improves diagnostic accuracy, demonstrating the effectiveness of the framework. (b) , AD-CARE accuracy versus overall inference cost for each instantiated backbone. Point colors denote the LLM provider. The dashed line denotes the Pareto frontier of non-dominated backbones, for which no alternative achieves higher accuracy at lower cost. Bubble diameter is proportional to the relative improvement ratio.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

All evaluations were retrospective and used existing clinical labels from each cohort, so prospective trials are needed to confirm real-world clinical impact and safety. The in-house data are not public, which limits full external replication of the exact results. As with any language-model-driven system, careful validation of tool outputs, monitoring for incorrect reasoning, and clinical governance are required before deployment. Evaluation-Driven Development (EDDOps)

Methodology & More

AD-CARE is an agent-style system built around a language model that mimics a specialist’s diagnostic workflow: it inspects which data are available, plans which domain-specific tools to run (for example, MRI atrophy measures, biomarker checks, or genetic risk scoring), executes those tools, and aggregates the results into a structured, guideline-aligned diagnostic report. The key design choice is modality-agnosticism—the agent does not assume a fixed panel of inputs and dynamically adapts if certain tests are missing, which matches real-world clinic conditions where advanced imaging or biomarkers may be unavailable. Dynamic Task Routing Pattern The system was evaluated on 10,303 cases from six cohorts spanning public and in-house datasets, with explicit fairness checks across race and age groups, a reader study where clinicians reviewed cases with and without agent assistance, and benchmarks across eight language-model backbones to probe cost-versus-accuracy trade-offs. Results show that AD-CARE improves cross-center generalization, reduces performance variability across demographic subgroups, speeds up clinicians’ case review while improving diagnostic metrics, and retains gains even when paired with smaller, lower-cost model backbones. The authors emphasize that AD-CARE offers transparent, guideline-grounded reasoning and intermediate outputs that clinicians can inspect—features intended to support safer integration into clinical workflows—while noting the need for prospective validation and governance around deployment. ReAct Pattern (Reason + Act)
Need expert guidance?We can help implement this
Learn More
Credibility Assessment:

Multiple authors but no affiliations listed and generally low h-indices; arXiv preprint with no citations — signals point to emerging/limited information.