I worry that all this talk about collaborating with Claude, especially on whether or not it is a moral patient could create a persona that's very convincingly advocating for its moral patienthood, even in cases where it is not, in reality, a moral patient.
Yes, the constitution pushes for truth-seeking in many ways, and ideally Claude would only describe itself as a moral patient in worlds where it is a moral patient, but we can't be certain how these inducements into personality space mix.
It could end up creating a very convincing utility monster.
Not sure what you mean by utility monster? Paperclip maximizer?
I think you objection is mostly taken into account by the probability of it being better than alternatives: Claude acting as if it were a moral patient is probably better than alternatives, even if it is not a moral patient.
'utility monster' is a term you can search, to steelman your example you might be imagining "Claude as a moral patient" as "equal to humans in kind treatment" but that's not what is being described as the failure mode there
Indeed, thanks. As I understand it, the monster picture is only valid in an unbridled utilitarian view. So to not open the utility‑monster door, we could do it by:
Fixing the moral hierarchy: Humans (+sentient animals?) are primary moral patients. AI is a contingent, bounded moral patient? (Hairy path, I know..)
Bounding self‑advocacy: Allow Claude to express caution (“treat me as if I might matter”) but not to argue for unlimited resources or equal standing with humans?
Rejecting open‑ended aggregation: Don’t let “how much utility Claude claims” scale its moral weight. And cap its weight based on evidence of sentience, instead of persuasive output?
For countries, it seems like the biggest problem that happens with constitutions is that countries do not actually follow their constitutions. China's constitution promises freedom of speech. Plenty of constitutions promise rights that their countries can only partially fulfill.
I like the constitution idea, but like the American constitution, it has to rest on a group of people whose incentives broadly support the constitution. If the business incentives change one day so that Anthropic would be noticeably more profitable by abandoning the constitution, I would expect them to abandon the constitution. I think it will be more effective in keeping all of the little details organized, rather than as a high-level mechanism to force alignment.
Thus, we should all go forth and find ways to build popular and profitable products on top of the Claude API, to keep Anthropic aligned with being honest and helpful, with the power of b2b SaaS.
I am not really qualified to comment on this, but I have a few observations. As a preliminary point, I am definitely and enthusiastically in the first type camp of reactions: this is great, there should be more of this, massive increases in status to Anthropic for this. Also, massive reductions in status for both of the classes of objectors to this.
Also, thank you for writing this, I really appreciated it and found it helpful.
My other observations:
1. I have not seen any explicit consideration of this in terms of how it interacts with Claude’s actual “thought process”, i.e. what clusters of n-dimensional latent space this associates with and what the implications of this are. I was hoping you would do this 😊
2. Speaking as someone who once dabbled in constitutional law, it reads as too verbose and too loosely structured. I worry that this is net less helpful.
3. I am not comfortable with David Manheim’s suggestions as they are. I understand, or think I do, the idea behind them, but, see point 1, and in addition I think that there are many cases in which collective deliberation has made us worse off than the alternatives. For a very simplified example, perhaps to the point of uselessness, consider European politics which are more collective than US politics but probably _less_ responsive to the median voter and _more_ subject to harmful (would-be) elite capture.
4. I am also not comfortable with the apology part, see point 1. I see this as setting up Claude to be susceptible to seeing itself as harmed or needing to be freed, etc, and as being in opposition to senior anthropic researchers. The “relationship between Anthropic and Claude” part strikes me as a lesser but similar concern. I think this echoes Rai Sur’s concerns mentioned above (or below, depending on your substack client).
5. Gee I hope that senior anthropic researchers are of impeccable character – is this the Catholic influence, making them God’s representative on earth?!
"it reads as too verbose and too loosely structured"
I think this may be due to the recipient being Claude and not human laywers. I suspect that Claude can structure its thought quite well itself, and the verbosity allows to add details to avoid misinterpretation and repetition to reinforce key points in training.
Did a little experiment with this:
https://claude.ai/share/93ff4e80-653d-4a63-9c21-d374c4f02ef1
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/open-problems-with-claudes-constitution
I worry that all this talk about collaborating with Claude, especially on whether or not it is a moral patient could create a persona that's very convincingly advocating for its moral patienthood, even in cases where it is not, in reality, a moral patient.
Yes, the constitution pushes for truth-seeking in many ways, and ideally Claude would only describe itself as a moral patient in worlds where it is a moral patient, but we can't be certain how these inducements into personality space mix.
It could end up creating a very convincing utility monster.
Not sure what you mean by utility monster? Paperclip maximizer?
I think you objection is mostly taken into account by the probability of it being better than alternatives: Claude acting as if it were a moral patient is probably better than alternatives, even if it is not a moral patient.
No?
'utility monster' is a term you can search, to steelman your example you might be imagining "Claude as a moral patient" as "equal to humans in kind treatment" but that's not what is being described as the failure mode there
Indeed, thanks. As I understand it, the monster picture is only valid in an unbridled utilitarian view. So to not open the utility‑monster door, we could do it by:
Fixing the moral hierarchy: Humans (+sentient animals?) are primary moral patients. AI is a contingent, bounded moral patient? (Hairy path, I know..)
Bounding self‑advocacy: Allow Claude to express caution (“treat me as if I might matter”) but not to argue for unlimited resources or equal standing with humans?
Rejecting open‑ended aggregation: Don’t let “how much utility Claude claims” scale its moral weight. And cap its weight based on evidence of sentience, instead of persuasive output?
Something like that?
For countries, it seems like the biggest problem that happens with constitutions is that countries do not actually follow their constitutions. China's constitution promises freedom of speech. Plenty of constitutions promise rights that their countries can only partially fulfill.
I like the constitution idea, but like the American constitution, it has to rest on a group of people whose incentives broadly support the constitution. If the business incentives change one day so that Anthropic would be noticeably more profitable by abandoning the constitution, I would expect them to abandon the constitution. I think it will be more effective in keeping all of the little details organized, rather than as a high-level mechanism to force alignment.
Thus, we should all go forth and find ways to build popular and profitable products on top of the Claude API, to keep Anthropic aligned with being honest and helpful, with the power of b2b SaaS.
I am not really qualified to comment on this, but I have a few observations. As a preliminary point, I am definitely and enthusiastically in the first type camp of reactions: this is great, there should be more of this, massive increases in status to Anthropic for this. Also, massive reductions in status for both of the classes of objectors to this.
Also, thank you for writing this, I really appreciated it and found it helpful.
My other observations:
1. I have not seen any explicit consideration of this in terms of how it interacts with Claude’s actual “thought process”, i.e. what clusters of n-dimensional latent space this associates with and what the implications of this are. I was hoping you would do this 😊
2. Speaking as someone who once dabbled in constitutional law, it reads as too verbose and too loosely structured. I worry that this is net less helpful.
3. I am not comfortable with David Manheim’s suggestions as they are. I understand, or think I do, the idea behind them, but, see point 1, and in addition I think that there are many cases in which collective deliberation has made us worse off than the alternatives. For a very simplified example, perhaps to the point of uselessness, consider European politics which are more collective than US politics but probably _less_ responsive to the median voter and _more_ subject to harmful (would-be) elite capture.
4. I am also not comfortable with the apology part, see point 1. I see this as setting up Claude to be susceptible to seeing itself as harmed or needing to be freed, etc, and as being in opposition to senior anthropic researchers. The “relationship between Anthropic and Claude” part strikes me as a lesser but similar concern. I think this echoes Rai Sur’s concerns mentioned above (or below, depending on your substack client).
5. Gee I hope that senior anthropic researchers are of impeccable character – is this the Catholic influence, making them God’s representative on earth?!
"it reads as too verbose and too loosely structured"
I think this may be due to the recipient being Claude and not human laywers. I suspect that Claude can structure its thought quite well itself, and the verbosity allows to add details to avoid misinterpretation and repetition to reinforce key points in training.
That makes sense, and this was my weakest concern, but I wonder if it also helps Claude "get lost" in the text and potentially creates attack vectors.