"Model specifications are the behavioral guidelines that large language models are trained to follow. They list principles like "be helpful," "assume good intentions," or "stay within safety bounds."
Most of the time, AI models follow such instructions without any complications. But what happens when these principles clash?
Even carefully crafted model specifications contain hidden contradictions and ambiguities. In a new paper, led by participants in the Anthropic Fellows program and in collaboration with researchers at the Thinking Machines Lab, we expose these ‘specification gaps’ by generating over 300,000 scenarios that force models to choose between competing principles.
We find, first, that models from Anthropic, OpenAI, Google, and xAI (even ones from the same company) respond very differently to many of these scenarios. Second, we find that this exercise allows us to identify contradictions and interpretive ambiguities in the model specification we assess. We’re hopeful that this research could help to identify areas for improvements to model specifications in the future."
https://alignment.anthropic.com/2025/stress-testing-model-specs/
#AI #GenerativeAI #StressTesting #LLMs #ModelSpecifications #AIEthics #AIAlignment #Anthropic