A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment. Distributionally Robust Reinforcement Learning (DRRL) optimizes worst-case performance within an uncertainty set of radius ε, but fixing ε exposes a sharp trade-off: small values yield strong nominal performance with weak robustness, while large values destabilize training and produce overly conservative policies.
DR-SPCRL removes this trade-off by treating ε as a continuous curriculum. The robustness budget is adapted online from the agent's own per-state response to perturbation, balancing nominal return against robustness throughout training. Across continuous-control benchmarks, DR-SPCRL improves episodic return under perturbation by an average of 24.1% over fixed or heuristic schedules.
Drag the slider to expand the robustness budget ε. Every world inside the blue ball is a perturbation the policy must survive. Hover any dot to watch the nominal Hopper try, and fail, in that world.
Each dot is one perturbed world (action, observation, or environment noise; the type is labeled on hover). Color shows the nominal policy's return there: red is failing, green is near-nominal. As ε grows, the ball covers more failing worlds, that's the cost of using a non-robust policy.
Fixing ε forces a sharp trade-off. Small ε helps nominally but collapses under shift; large ε destabilizes training. Scroll to watch each strategy draw in.
A continuous curriculum over ε avoids both failure modes: it starts near nominal and only opens the budget as the agent can handle it.
Instead of fixing ε or annealing it linearly, DR-SPCRL learns it from the agent's own βs.
Let $P_0(\cdot \mid s,a)$ be the nominal transition kernel and let $\mathcal{B}_\varepsilon(P_0)$ be the KL ambiguity ball of radius $\varepsilon$:
DRRL trains a policy that performs well under the worst transition kernel in this ball:
$\varepsilon$ controls the strength of the adversary. Larger $\varepsilon$ admits more distorted dynamics, yielding more conservative but more robust policies; smaller $\varepsilon$ recovers the nominal objective.
Fix a state-action pair $(s,a)$ and a value function $V$ produced by the current policy. The one-step DR Bellman backup faces the inner problem $\min_{P}\;\mathbb{E}_{s'\sim P}\!\big[V(s')\big]$ subject to $D_{\mathrm{KL}}(P\,\Vert\,P_0)\le\varepsilon$. Forming the Lagrangian and applying the Donsker–Varadhan / Legendre identity for KL-constrained expectations gives the closed form
The dual variable $\beta(s,a)\ge 0$ is the Lagrange multiplier on the KL constraint and acts as a state-dependent temperature: it turns the worst-case min over $s'$ into a soft-min (log-sum-exp) over next-state values. The optimal $\beta^\star(s,a)$ satisfies the stationarity condition
Interpretation. When the value $V$ is roughly flat across next states the soft-min is near the mean: a small $\beta$ saturates the KL ball without much value loss, so $\beta^\star$ is small — the policy is robust here. When $V$ has a sharp bad-state outlier, the adversary concentrates mass there, the soft-min collapses toward that outlier, and $\beta^\star$ grows — the policy is fragile at $(s,a)$.
Plugging the dual solution back into the outer max gives the surrogate $L(\pi,\varepsilon)$ used by SPCRL. Adding a self-paced regularizer that pulls $\varepsilon$ toward a target budget $\varepsilon_{\mathrm{budget}}$, the $\varepsilon$-gradient evaluates to
where $d^\pi$ is the state visitation under the current policy. A gradient descent step on $L$ in $\varepsilon$ is therefore
The two terms have opposite signs:
Take the small-$\eta$ continuum limit of the update rule. With $\bar\beta^\star(t):=\mathbb{E}_{s\sim d^{\pi_t}}[\beta^\star(s,\pi_t(s))]$, the curriculum obeys a first-order linear ODE driven by the policy's own fragility:
Solving by integrating factor gives a closed-form scaling law with three terms, all stated directly in $\bar\beta^\star$:
The third term is the only one that depends on the agent. It is a causal exponential low-pass of the fragility signal with time constant $1/\alpha_{\mathrm{SPCL}}$: recent values of $\bar\beta^\star$ count more, distant past values fade. While $\bar\beta^\star$ stays high, this term keeps rising and the spring lags — ε climbs slowly. As the policy hardens and $\bar\beta^\star$ shrinks, the low-pass empties out and the spring dominates, pulling ε toward $\varepsilon_{\mathrm{budget}}$ at rate $\alpha_{\mathrm{SPCL}}$. The two regimes you see in the live curve above — slow warm-up then fast climb — come directly from this decomposition; no assumption on the shape of $\bar\beta^\star$ is needed.
Trajectories of the robustness budget ε(t) (blue, linear axis) and the dual variable β(t) (orange, log axis) across all 12 (algorithm, env) cells. ε draws in first, then β layers on top. The self-paced rule discovers a non-trivial schedule for each cell.
Episodic return under perturbation. Interactive: choose a base algorithm, environment, and perturbation type.
Episodic return during training. All 7 methods draw simultaneously across the 3 × 4 (algorithm × environment) grid. DR-SPCRL (teal, dashed) matches or beats every fixed-budget and heuristic-schedule baseline.
@article{satheesh2025distributionally,
title={Distributionally Robust Self Paced Curriculum Reinforcement Learning},
author={Satheesh, Anirudh and Powell, Keenan and Aggarwal, Vaneet},
journal={arXiv preprint arXiv:2511.05694},
year={2025}
}