In Brief
Giving indoor robots a shared world model fed by home sensors (like cameras) cuts wasted searching, speeds up multi-robot tasks, and lowers the planning cost of using large language models.
ON THIS PAGE
The Evidence
A new benchmark and framework called IndoorR2X enforces realistic limited vision for each robot and adds simulated home sensors to build a single shared semantic world state. A central planner (a large language model) uses that fused state to make parallel plans for multiple robots, and an orchestrator executes and replans as needed. Compared to robots working alone or only sharing maps with each other, adding ambient sensor feeds [/patterns/capability-discovery-pattern] reduced unnecessary exploration, shortened paths and action counts, and cut planner token usage, while remaining robust to missing sensors but vulnerable to incorrect semantic reports. Coordination quality improves with stronger planners but coordination overhead rises as teams grow larger.
Data Highlights
185 multi-room environments used for evaluation, mixing 10 curated homes and 75 modular apartments
2About 50% of each environment’s area was randomly covered by simulated CCTV feeds that feed the shared world model
3Teams of 2–6 robots were tested; success rate stayed stable for team sizes up to 5 while total travel and coordination overhead increased with more robots
Why It Matters
Engineers building indoor robot fleets and technical leads evaluating multi-robot deployments — because the work shows how adding inexpensive ambient sensing can cut physical work and reduce how much planner compute you need. Researchers designing multi-agent planners and system architects for smart buildings can use the benchmark and framework to test how sensor placement and signal reliability affect team behavior. [/patterns/evaluation-driven-development-pattern]
Not sure where to start?Get personalized recommendations
Key Figures

Fig 1: Figure 1: Motivation for IndoorR2X. Augmenting robot perception with global IoT context via LLMs for efficient coordination.

Fig 2: Figure 2: Our IndoorR2X framework. CCTV observations and other IoT device signals are collected to augment the world model beyond the perception range of the robots’ ego cameras. These heterogeneous observations are synchronized through a coordination hub, where an LLM-based online planner generates parallel actions for each robot and executes them to perform their respective tasks. As an example scenario, robots are assigned to perform household tasks in the morning. After potential overnight changes to object locations or device statuses (e.g., TVs), robots first update their indoor world model by leveraging the “X” observations.

Fig 3: Figure 3: Scalability analysis. Success rate (left) and efficiency metrics (center/right) as a function of team size ( N = 2 N=2 to 6 6 ). While success remains stable up to N = 5 N=5 , the coordination overhead (total distance traveled) increases with fleet size.

Fig 4: Figure 5: Qualitative demonstration of IndoorR2X (simulation environment). Three robots and IoT sensors coordinate to efficiently dispose of perishables, power down devices, and consolidate items in the family room.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
Results come mostly from simulated indoor environments (AI2-THOR) with initial real-world trials, so real deployments may expose more noise and integration issues. The system tolerates missing sensor data but is sensitive to incorrect semantic information, so verification and trust signals for IoT inputs are essential. Coordination overhead and planner cost grow with team size, so benefits may diminish without careful orchestration and stronger planners for larger fleets. [/patterns/capability-attestation-pattern]
Deep Dive
IndoorR2X builds a realistic testbed for indoor multi-robot coordination by enforcing partial observability: each robot only knows what it has seen. To expand shared knowledge without making any single robot omniscient, static home sensors (simulated CCTV and other device signals) are fed through vision-language processing into a coordination hub [/patterns/supervisor-pattern]. That hub maintains a global semantic state (robots, objects, rooms, timestamps and nested containment relationships) and exposes it to an online planner based on a large language model. The planner outputs a dependency-style, parallelizable plan; an orchestrator issues actions to robots, watches outcomes, updates the world model, and triggers replanning when needed. In controlled experiments across 85 environments, the team compared three modes: isolated robots, robots sharing maps with each other, and robots plus ambient sensors (robot-to-everything). Adding the ambient sensors consistently reduced redundant exploration, lowered path lengths and number of actions, and decreased the token cost the language model used for planning. The gains depend on planner capability (better planners make better coordination choices) and on team size: success stayed high up to five robots but total travel and coordination overhead grew with more agents. The setup handled missing IoT reports well, but incorrect semantic reports (wrong object states) caused more serious failures, highlighting the need for sensor verification and trust signals in production systems.
Explore evaluation patternsSee how to apply these findings
Credibility Assessment:
Multiple authors but low h-indexes (<10), no listed affiliations and arXiv venue; modest team size but limited reputation signals.