We stress-test the monitorability of chain-of-thought reasoning in language models, investigating whether reasoning models can obfuscate their reasoning processes.

Read the paper on arXiv →

Paper accepted at NeurIPS 2025 FoRLM workshop. Findings used by Anthropic in Claude Opus 4.6 safety evaluations and cited by OpenAI’s Monitoring Monitorability paper.