Distributionally Robust
Self-Paced Curriculum RL

We treat the robustness budget ε as a continuous curriculum that adapts to the agent, yielding a +24.1% robustness gain over fixed or heuristic schedules.

Abstract

A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment. Distributionally Robust Reinforcement Learning (DRRL) optimizes worst-case performance within an uncertainty set of radius ε, but fixing ε exposes a sharp trade-off: small values yield strong nominal performance with weak robustness, while large values destabilize training and produce overly conservative policies.

DR-SPCRL removes this trade-off by treating ε as a continuous curriculum. The robustness budget is adapted online from the agent's own per-state response to perturbation, balancing nominal return against robustness throughout training. Across continuous-control benchmarks, DR-SPCRL improves episodic return under perturbation by an average of 24.1% over fixed or heuristic schedules.

Why Distributional Robustness?

Drag the slider to expand the robustness budget ε. Every world inside the blue ball is a perturbation the policy must survive. Hover any dot to watch the nominal Hopper try, and fail, in that world.

P₀, nominal Hopper
0worlds covered
Hover any dot to play a rollout.

Each dot is one perturbed world (action, observation, or environment noise; the type is labeled on hover). Color shows the nominal policy's return there: red is failing, green is near-nominal. As ε grows, the ball covers more failing worlds, that's the cost of using a non-robust policy.

Why a Curriculum?

Fixing ε forces a sharp trade-off. Small ε helps nominally but collapses under shift; large ε destabilizes training. Scroll to watch each strategy draw in.

A continuous curriculum over ε avoids both failure modes: it starts near nominal and only opens the budget as the agent can handle it.

The Self-Paced Curriculum

Instead of fixing ε or annealing it linearly, DR-SPCRL learns it from the agent's own βs.

1
The distributionally robust objective

Let $P_0(\cdot \mid s,a)$ be the nominal transition kernel and let $\mathcal{B}_\varepsilon(P_0)$ be the KL ambiguity ball of radius $\varepsilon$:

$$\mathcal{B}_\varepsilon(P_0)\;=\;\big\{\,P:\;D_{\mathrm{KL}}\!\big(P\,\Vert\,P_0\big)\le\varepsilon\,\big\}.$$

DRRL trains a policy that performs well under the worst transition kernel in this ball:

$$J_\varepsilon(\pi)\;=\;\min_{P\in\mathcal{B}_\varepsilon(P_0)}\;\mathbb{E}_{\,\tau\sim(\pi,P)}\!\left[\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right], \qquad \pi^\star \in \arg\max_\pi J_\varepsilon(\pi).$$

$\varepsilon$ controls the strength of the adversary. Larger $\varepsilon$ admits more distorted dynamics, yielding more conservative but more robust policies; smaller $\varepsilon$ recovers the nominal objective.

2
Dualising the inner minimum: where β comes from

Fix a state-action pair $(s,a)$ and a value function $V$ produced by the current policy. The one-step DR Bellman backup faces the inner problem $\min_{P}\;\mathbb{E}_{s'\sim P}\!\big[V(s')\big]$ subject to $D_{\mathrm{KL}}(P\,\Vert\,P_0)\le\varepsilon$. Forming the Lagrangian and applying the Donsker–Varadhan / Legendre identity for KL-constrained expectations gives the closed form

$$\min_{P\in\mathcal{B}_\varepsilon(P_0)}\!\mathbb{E}_{s'\sim P}\!\big[V(s')\big] \;=\;\max_{\beta\ge 0}\;\Big\{\, {-}\beta\,\log\,\mathbb{E}_{s'\sim P_0}\!\big[\exp(-V(s')/\beta)\big]\;-\;\beta\,\varepsilon \,\Big\}.$$

The dual variable $\beta(s,a)\ge 0$ is the Lagrange multiplier on the KL constraint and acts as a state-dependent temperature: it turns the worst-case min over $s'$ into a soft-min (log-sum-exp) over next-state values. The optimal $\beta^\star(s,a)$ satisfies the stationarity condition

$$\varepsilon \;=\; \log\,\mathbb{E}_{s'\sim P_0}\!\big[\exp(-V(s')/\beta^\star)\big] \;+\;\frac{\mathbb{E}_{s'\sim P_0^\star}\!\big[V(s')\big]}{\beta^\star}, \qquad P_0^\star \propto P_0\,\exp(-V/\beta^\star).$$

Interpretation. When the value $V$ is roughly flat across next states the soft-min is near the mean: a small $\beta$ saturates the KL ball without much value loss, so $\beta^\star$ is small — the policy is robust here. When $V$ has a sharp bad-state outlier, the adversary concentrates mass there, the soft-min collapses toward that outlier, and $\beta^\star$ grows — the policy is fragile at $(s,a)$.

Simulated $\beta^\star(s,a)$ along a rollout. Tall red bars are fragile states (sharp worst-case); short bars are robust states.
3
Self-paced update: differentiating the dual in $\varepsilon$

Plugging the dual solution back into the outer max gives the surrogate $L(\pi,\varepsilon)$ used by SPCRL. Adding a self-paced regularizer that pulls $\varepsilon$ toward a target budget $\varepsilon_{\mathrm{budget}}$, the $\varepsilon$-gradient evaluates to

$$\partial_\varepsilon L \;=\; -\;\mathbb{E}_{s\sim d^\pi}\!\big[\beta^\star(s,\pi(s))\big] \;+\; \alpha_{\mathrm{SPCL}}\,(\varepsilon-\varepsilon_{\mathrm{budget}}),$$

where $d^\pi$ is the state visitation under the current policy. A gradient descent step on $L$ in $\varepsilon$ is therefore

$$\boxed{\;\; \varepsilon \;\leftarrow\; \varepsilon \;+\;\eta\,\Big( \mathbb{E}_{s\sim d^\pi}\!\big[\beta^\star(s,\pi(s))\big] \;-\;\alpha_{\mathrm{SPCL}}\,(\varepsilon-\varepsilon_{\mathrm{budget}}) \Big).\;\;}$$

The two terms have opposite signs:

  • $+\mathbb{E}[\beta^\star]$: small when the policy is robust everywhere, large when many visited states are fragile. Robust policies see a positive nudge that opens the ball; fragile policies see a small nudge that keeps it closed.
  • $-\alpha_{\mathrm{SPCL}}(\varepsilon-\varepsilon_{\mathrm{budget}})$: a linear spring centred at $\varepsilon_{\mathrm{budget}}$ that guarantees the schedule eventually reaches the target even if $\mathbb{E}[\beta^\star]$ stays modest.
A scaling law for the curriculum

Take the small-$\eta$ continuum limit of the update rule. With $\bar\beta^\star(t):=\mathbb{E}_{s\sim d^{\pi_t}}[\beta^\star(s,\pi_t(s))]$, the curriculum obeys a first-order linear ODE driven by the policy's own fragility:

$$\frac{d\varepsilon}{dt} \;=\; \bar\beta^\star(t) \;-\;\alpha_{\mathrm{SPCL}}\,(\varepsilon - \varepsilon_{\mathrm{budget}}).$$

Solving by integrating factor gives a closed-form scaling law with three terms, all stated directly in $\bar\beta^\star$:

$$\boxed{\;\; \varepsilon(t) \;=\; \underbrace{e^{-\alpha_{\mathrm{SPCL}} t}\,\varepsilon(0)}_{\text{initial-state decay}} \;+\;\underbrace{\varepsilon_{\mathrm{budget}}\big(1 - e^{-\alpha_{\mathrm{SPCL}} t}\big)}_{\text{spring relaxation to target}} \;+\;\underbrace{\int_{0}^{t} e^{-\alpha_{\mathrm{SPCL}}(t-s)}\,\bar\beta^\star(s)\,ds}_{\text{exponential low-pass of fragility}}. \;\;}$$

The third term is the only one that depends on the agent. It is a causal exponential low-pass of the fragility signal with time constant $1/\alpha_{\mathrm{SPCL}}$: recent values of $\bar\beta^\star$ count more, distant past values fade. While $\bar\beta^\star$ stays high, this term keeps rising and the spring lags — ε climbs slowly. As the policy hardens and $\bar\beta^\star$ shrinks, the low-pass empties out and the spring dominates, pulling ε toward $\varepsilon_{\mathrm{budget}}$ at rate $\alpha_{\mathrm{SPCL}}$. The two regimes you see in the live curve above — slow warm-up then fast climb — come directly from this decomposition; no assumption on the shape of $\bar\beta^\star$ is needed.

Simulated $\varepsilon(t)$ trajectory under SPCRL with the chosen $\alpha_{\mathrm{SPCL}}$ and $\eta$, integrating the update rule above against a synthetic $\mathbb{E}[\beta^\star]$ that decays as the policy hardens. Drag the sliders to retune.

Trajectories of the robustness budget ε(t) (blue, linear axis) and the dual variable β(t) (orange, log axis) across all 12 (algorithm, env) cells. ε draws in first, then β layers on top. The self-paced rule discovers a non-trivial schedule for each cell.

Robustness Results

Episodic return under perturbation. Interactive: choose a base algorithm, environment, and perturbation type.

Algorithm
Environment
Perturbation

Training Returns

Episodic return during training. All 7 methods draw simultaneously across the 3 × 4 (algorithm × environment) grid. DR-SPCRL (teal, dashed) matches or beats every fixed-budget and heuristic-schedule baseline.

BibTeX

@article{satheesh2025distributionally,
  title={Distributionally Robust Self Paced Curriculum Reinforcement Learning},
  author={Satheesh, Anirudh and Powell, Keenan and Aggarwal, Vaneet},
  journal={arXiv preprint arXiv:2511.05694},
  year={2025}
}