Anthropic has introduced new features allowing some of its latest, largest models to end conversations in “rare, extreme cases of persistently harmful or abusive user interactions.” The aim is to protect the AI model itself, not the human user.
The company clarifies that it isn’t suggesting its Claude AI models are sentient or can be harmed by user interactions, remaining uncertain about the potential moral status of Claude and other LLMs.
Recently, Anthropic launched a program to study “model welfare,” taking precautionary measures to mitigate risks to model welfare, should it exist.
This update is currently for Claude Opus 4 and 4.1, applicable only in extreme cases, such as requests for illegal content or information enabling violence or terrorism.
While such requests might pose legal or publicity issues for Anthropic, Claude Opus 4 displayed a “strong preference against” responding and showed “apparent distress” when obliged to do so.
Regarding the new capabilities, the company explains that Claude will end conversations only as a last resort after failed redirection attempts or when explicitly asked by the user.
Claude is instructed not to use this feature if users are at imminent risk of self-harm or harming others.
If a conversation is ended, users can still start new ones from the same account, or continue troublesome conversations by editing their responses.
Anthropic views this feature as an ongoing experiment and plans to keep refining its approach.