Modern AI systems are increasingly, and perhaps alarmingly, exceeding human performance in domains such as competition mathematics and coding (UK AISI, 2025). AI agents can now independently implement software engineering artifacts requiring hours of complex reasoning effort from humans. As AI capability and agency increase, designing reliable mechanisms to align AI systems, ensuring they act consistently with human values even when unmonitored, grows ever more urgent. Yet AI alignment remains poorly understood.