The Big Picture
A store-level AI that adjusts a single optimizer weight in real time boosted batching and cut courier active time without hurting customer delivery quality.
ON THIS PAGE
The Evidence
A lightweight control layer that chooses a small set of objective-weight multipliers per store can steer a complex dispatch optimizer to better trade off speed and batching. Trained offline from delayed marketplace outcomes and deployed at scale, the policy increased batching and reduced excess courier time while preserving customer-facing delivery quality. The constrained action interface kept existing optimizer safeguards and made production deployment reliable and scalable. Evaluation-Driven Development (EDDOps)
Not sure where to start?Get personalized recommendations
Data Highlights
1Deployed at scale: hundreds of millions of daily policy inferences executed every ~20 seconds.
2Field experiment randomized about 4,000 geographic regions with 2-hour switchback intervals over a two-week period.
3Reward reweighting used α = 0.9 to balance behavior, and conservative offline regularization reduced unsupported-action overestimation during training.
What This Means
Engineers building dispatch, routing, or marketplace systems who want adaptive controls without replacing core optimizers should study this approach. Product and ops leaders at delivery marketplaces can use store-level objective adaptation to improve courier efficiency while maintaining customer quality. Market-Based Coordination Pattern
Key Figures

Fig 1: Figure 1 : Agentic objective-weight adaptation loop for production dispatch. During online serving, a policy agent observes local states and selects an objective-weight multiplier that parameterizes the assignment optimizer. The optimizer remains responsible for feasible courier-order assignment decisions. During offline learning, logged runs are joined with delayed marketplace outcomes to construct transition data tuples for policy training and deployment.

Fig 2: Figure 2 : Logistics timing compoents used for delayed reward attribution. ASAP measures customer-facing delivery duration, and XCAT measures excess courier active time (CAT) beyond the direct route. The RL-selected ASAP-weight multiplier steers the optimizer’s speed-efficiency tradeoff: higher weights favor faster delivery completion, while lower weights make batching-compatible assignments more attractive. Observed ASAP and XCAT outcomes are joined back to dispatch runs for reward construction.

Fig 3: Figure 3 : Empirical action distributions from DoorDash production logs during a Friday dinner peak, showing state-dependent ASAP-weight multipliers across backlog, supply pressure, and CWT.

Fig 4: Figure 4 : Offline training MSE loss across epochs. The DQN baseline without CQL regularization minimizes MSE faster, while OWA-RL maintains a higher loss due to the conservative penalty used to reduce unsupported-action overestimation.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
Rewards are constructed from delayed, region-level outcomes, which makes credit assignment for single store actions noisy. The policy only controls a low-dimensional multiplier rather than direct assignments, so some opportunities that require richer actions remain unreachable. Offline training reliability depends on marketplace stability; distribution shifts after deployment require careful monitoring and rollback plans. drift
Methodology & More
A two-layer dispatch design put a lightweight learning agent in front of the production assignment optimizer. Each store-level agent observes local state and picks one of a few discrete multipliers that scale the optimizer’s delivery-speed objective; the optimizer still produces feasible courier-order assignments. Logged dispatch runs were joined with delayed marketplace metrics (like customer delivery time and excess courier active time) to form training data. Offline training used conservative regularization to avoid overestimating actions not well supported by logs, and a reward-reweighting diagnostic helped pick α = 0.9 to keep the learned policy from collapsing to extreme behaviors. Planning Pattern Guardrails Pattern A large randomized switchback experiment—roughly 4,000 regions toggled every two hours across two weeks—compared the adaptive policy to the static baseline. Results in production showed higher batching and reduced courier-side time cost while keeping customer-facing delivery quality and cancellations stable. Because the agent only manipulates a single, low-dimensional interface, it preserved existing operational safeguards and served reliably (hundreds of millions of inferences daily at a ~20-second cadence). The approach trades off direct control for safer, faster deployment; future work should add distribution-shift detection, richer decision layers, and interpretability tools to track interactions across many agents.
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Authors have low h-indices and there are no clear high-profile affiliations or top-venue publication — emerging-level credibility.