Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter.

We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n = 1000$).

We report four findings. (1) Calibration disparity is a hidden fairness violation.

ERM has near-perfect aggregate calibration ($0.013$) but is significantly miscalibrated across all identity subgroups ($+0.029$ to $+0.134$). (2) Training interventions reshape rather than eliminate disparity.

Reweighted ERM improves ranking (BPSN AUC $+0.06$ to $+0.12$) but worsens the calibration-fairness gap by up to $+0.232$. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE $0.118$).

(3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform.

Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair.

Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details