AEGIS-RL: Abstract, Explainable Graphs for Integrated Safety in RL

doi:10.21203/rs.3.rs-7512385/v1

Download PDF

Research Article

AEGIS-RL: Abstract, Explainable Graphs for Integrated Safety in RL

https://doi.org/10.21203/rs.3.rs-7512385/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires more than formal verification: it needs interpretability and targeted falsification—the deliberate search for counter-examples that expose potential failures before deployment. We present AEGIS-RL (Abstract, Explainable Graphs for Integrated Safety in RL), a hybrid framework that unifies (1) explainable RL, (2) probabilistic model checking, and (3) risk-guided falsification, and augments them with (4) a lightweight runtime safety shield that switches to a fallback policy when estimated risk exceeds a threshold. AEGIS-RL first builds a directed, semantically meaningful graph from offline trajectories that blends local and global explanations to make policy behavior transparent and verifier-friendly. This abstract graph is fed to a probabilistic model checker (e.g., Storm) to verify temporal safety specifications; when violations exist, the checker returns interpretable counterexample traces that pinpoint how the policy fails. When specifications appear satisfied, AEGIS-RL estimates residual risk during checking to steer falsification toward high-risk, under-explored states, broadening coverage beyond the offline data. Across safety-critical benchmarks including two MuJoCo tasks and a medical insulin-dosing scenario; AEGIS-RL uncovers significantly more violations than uncertainty- and fuzzing-based baselines and yields a broader, more novel set of failure trajectories. The resulting explanations and counterexamples provide actionable guidance to understand, debug, and repair unsafe policies while enabling runtime mitigation without retraining.

Explainable Reinforcement Learning

Formal Verification

Risk-Guided Falsification

Policy Interpretability

Safety Assurance

No competing interests reported.

Download PDF

Reviews received at journal
04 Nov, 2025
Reviews received at journal
11 Oct, 2025
Reviewers agreed at journal
28 Sep, 2025
Reviewers agreed at journal
25 Sep, 2025
Reviewers invited by journal
25 Sep, 2025
Editor assigned by journal
04 Sep, 2025
Submission checks completed at journal
02 Sep, 2025
First submitted to journal
01 Sep, 2025

You are reading this latest preprint version

AEGIS-RL: Abstract, Explainable Graphs for Integrated Safety in RL

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1