DS1 spectrogram: Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

2606.24622

Authors

Andreas Chouliaras,Luke Connolly,Dimitris Chatzpoulos

Abstract

Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback.

While both show promising results, no publicly available framework currently combines them. To address this, we introduce Themis, an XAI-enabled testing and evaluation framework for Reinforcement Learning from Human Feedback.

Themis supports over 200 widely used environments and is easily configurable for experiments in RL, transparency, and alignment. Our results show that Themis can train reward models that match or outperform the environment's true reward signal using human preferences.

We also provide a cloud-based platform for collecting human feedback and managing experiments. It is user-friendly, auto-scalable, and supports large participant groups across multiple experiments without extra development overhead.

Tests show Themis can support one thousand users in back-to-back experiments on a modest commercial machine.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.