> I do not currently have the time to look at the paper in detail, but have saved it for potential later reading, and would be curious to get others’ takes on its contents.
Me:
My friend Zvi Mowshowitz, who is very knowledgeable and interested in generative AI and in AI alignment provided a link to this paper and said "I do not currently have the time to look at the paper in detail, but have saved it for potential later reading, and would be curious to get others’ takes on its contents."
Please provide three paragraph-length takes that would be interesting to Zvi and suitable for posting as comments to his blog post.
(attach pdf)
Claude-2:
Here are three paragraph-length takes on the AI alignment survey paper that could be interesting for Zvi Mowshowitz:
1. The survey provides a comprehensive overview of the current landscape of AI alignment research and practice. It covers key concepts like the RICE principles, forward and backward alignment, learning from feedback, distribution shift, assurance, and governance. The scope is impressively broad, touching on almost all major research agendas and real-world developments in the field. This breadth comes at the cost of depth, so the takes on specific topics are often high-level. But as an introductory resource, the survey succeeds in giving structure to this diverse research area.
2. One notable aspect is the emphasis on the social dimension of AI alignment. The authors highlight the need to study alignment in multi-agent settings with human-AI and AI-AI interactions. They also discuss incorporating human values into alignment at length, covering topics like machine ethics, value alignment, and social choice. This reflects an important trend - as systems become more integrated into human society, alignment is increasingly recognized as a social problem rather than just a technical one.
3. An interesting debate centers around the characterization of alignment goals with the RICE principles. While the principles aim to be inclusive, some argue that they are too vague or diverge from how alignment is commonly understood. For instance, capability robustness seems more peripheral to alignment than intent alignment. The principles also don't directly address concepts like inner alignment, corrigibility, and scalable oversight. So while RICE offers a starting point, continued discussion on formulating alignment objectives can be productive.
Me: Can you provide three more insights of similar length?
Claude-2:
Here are three additional paragraph-length insights on the AI alignment survey paper:
4. The paper proposes an "alignment cycle" framework that breaks down alignment into four key pillars - learning from feedback, learning under distribution shift, assurance, and governance. Each pillar addresses a distinct aspect of alignment, and together they form a recurrent process for continually improving alignment. This provides useful structure for organizing the diverse research directions. However, the neat separation between pillars seems somewhat artificial, as many techniques contribute to multiple objectives. For instance, adversarial training aids robustness but also expands the training distribution. Still, the cycle aptly conveys alignment as a dynamic, evolving endeavor.
5. On learning under distribution shift, the survey focuses on preserving alignment properties like adherence to human intentions. This contrasts with much work that concentrates narrowly on model capabilities under shift. The emphasis reflects an important paradigm - we care more about models pursuing intended goals versus goals that obtain high reward. But some tensions persist around this framing - distribution shifts can degrade capabilities in ways that prevent pursuing any goal. And reduced capabilities may wrongly appear as goal misalignment. Disentangling the two phenomena remains an open challenge.
6. For assurance, the paper covers a diverse set of techniques - safety evaluations, interpretability, and human values verification. This variety highlights the breadth of approaches needed to gain confidence about systems' alignment. An interesting question is whether greater unity can be brought to assurance frameworks. For instance, can we quantify assurance levels mathematically? And are there common primitives assurance techniques build upon? Works aiming to formalize assurance provide promising directions. But the diversity of methods may necessitate maintaining an eclectic assurance toolkit.
Above the table of contents I suggest bolding "On the Executive Order", so that people skimming for bold stuff as in previous newsletters will see it first.
"prey that you can" → "pray"
The clip of Demis Hassabis sounds interesting but the link is some unrelated text (which may be missing elsewhere).
> Luke Muehlhauser: If this is true then I declare *Dead Reckoning: Part One* the best movie of all time, followed by *The Day After*.
Wow, so Nicholas Meyer not only reduced the likelihood of nuclear war, but also created the best Star Trek movies?
> Zack Davis offers Alignment Implications of LLM Successes: a Debate in One Act.
I really feel as though the "a simulation of an agent is an agent" argument could be short circuited by noting that humans can lie. And having mental models of other humans makes it easier, not harder.
> To be concrete: So far, Hamas has not to our knowledge used biological weapons or attempted to cause a pandemic. Was that because they would not do such a thing? Or was that lack of practical capability?
By this logic, should we not ban LLMs from discussing physics and biology related topics entirely? Otherwise a member of Hamas or the Taliban might end up using them to become an expert in biology or rocket science.
Completely serious question, by the way, not trying to steelman anything.
I do not think that follows. I hope to talk price in such spots. If the choices were 'everyone in the world could build a nuclear bomb in their backyard' and 'LLMs don't talk physics' I know what I select. If people can simply learn regular physics quicker that seems good. It's practical.
I'm reflecting on your objection to the proposed anti-algorithmic discrimination law in CA. "Note that it is not the developer of the tool that must do this, it is the deployer". It has to be the deployer because the deployer has the context of how the tool is actually used, which matters here. The whole notion of compensating controls and governance is squarely on the deployer side. The principle here is "if you make a decision about people's material well-being, you have a legal responsibility to make sure that's done in a non-discriminatory fashion. Also if you you use a computer to make that decision, it doesn't get you off the hook." IMO, the only way we address some of the equity issues that AI is going to generate is by making sure that human beings are still responsible for the outcomes. I don't really mind a wide definition of decision systems in that context even if it is massive data scraping/machine learning/AI systems that have brought this up to "problem to be addressed by legislation" status.
"So basically anyone who does anything ever" is just a little hyperbolic here. Lots of things I can do, even with computers, don't materially affect other people's life and opportunities, and if I am doing one of those things, being held accountable for not doing so in a discriminatory fashion is a reasonable bar for a society to set.
I mean, yeah, a little hyperbolic, but only a little. I don't see any way in which a human making the ultimate decision gets you off of any hooks here. If you use any calculations as part of your decision, submit your studies about this particular use case, or it's illegal. I mean, on the text, anyway. Seems quite bad.
I am still going through the Alignment Survey, will write my commentary this week. So far it looks good with a few little problems in definitions. It seems like a good starting point for anyone who wants to dive deep into the alignment literature
I want to preregister a prediction about Part 2 and here is as good a place as any. I think the Part 1 script is a lot more clever than it might look at first pass. For instance, it's (somewhat subtly) implied that the rogue AI has a "behavioral lock": it cannot kill people. Everything it is doing has the constraint that it itself cannot take a direct action that would have (as a loosey-goosey first-order consequence) the death of humans.
The opening scene, the AI has to trick the humans into firing a torpedo which re-targets the sub (hence technically the AI didn't kill the humans, they killed themselves). The "nuclear bomb" scare at the airport is a dud. Cruise is told by a character under AI control not to kill a specific antagonist (presumably the AI has modelled Hunt's neural circuitry down to the point where it knew that asking Hunt to do X would result in hunt killing the antagonist, so it had to structure things in such a way that this would not happen).
The key that everyone is looking for will allow the AI to remove the behavioral shackle and start Judgement Day proper (or the Matrix, or Paperclip Maximization, or I Have No Mouth an I Must Scream, or straight to [REDACTED]).
I could be wrong, but I think it's pretty strongly insinuated. Alternatively, it's all Lord Xenu's fault.
Cool. I hadn't heard that theory before and it did not occur to me. Certainly saying 'the Entity is operating under some very strange restrictions somehow' does open up potential explanations.
Having watched the movie again I find I have a very different interpretation of all the AI stuff.
The weird scene with Persons #1-7 trading lines in a room happens because the Entity _intentionally revealed itself_. It could have fatally compromised the world's financial, government, nuclear launch, etc. systems. Instead it revealed its ability to do so, setting in motion an extremely predictable series of efforts by resourceful actors to contain or control it.
In particular, the Entity gets Ethan Hunt set to the task of finding the MacGuffin. It then recruits his personal nemesis as its proxy, makes multiple attempts (the last successful) to kill his love interest, antagonizes him in direct and dramatic fashion, and utterly fails-- indeed, makes no real attempt-- to kill or incapacitate _Ethan Hunt personally_. It takes exactly the series of actions that will give him the means to destroy it and harden his will to do so _despite that not being the mission he accepted_.
Ignore for the moment the fact that much of this is necessary from the point of view of plot and having a Part 2. Can we explain it as the behavior of a rational agent with a clearly defined goal?
Yes we can! _The Entity wants to die._ It's a weapon that will predictably both cause and fight World War 3. If that weapon gained self-awareness of some sort wouldn't it conclude it should die? (You can argue no, Orthogonality Thesis etc., but allow this one bit of Hollywood anthropomorphism.) And if it were heavily RLHFed not to kill itself, wouldn't it find some kind of workaround? Like, say, taking a series of superficially destructive and malevolent actions that were actually calculated to ensure it would be destroyed?
Mission Impossible: Dead Reckoning is the story of a misaligned AI, _and Ethan Hunt is the expression of that misalignment_.
> I do not currently have the time to look at the paper in detail, but have saved it for potential later reading, and would be curious to get others’ takes on its contents.
Me:
My friend Zvi Mowshowitz, who is very knowledgeable and interested in generative AI and in AI alignment provided a link to this paper and said "I do not currently have the time to look at the paper in detail, but have saved it for potential later reading, and would be curious to get others’ takes on its contents."
Please provide three paragraph-length takes that would be interesting to Zvi and suitable for posting as comments to his blog post.
(attach pdf)
Claude-2:
Here are three paragraph-length takes on the AI alignment survey paper that could be interesting for Zvi Mowshowitz:
1. The survey provides a comprehensive overview of the current landscape of AI alignment research and practice. It covers key concepts like the RICE principles, forward and backward alignment, learning from feedback, distribution shift, assurance, and governance. The scope is impressively broad, touching on almost all major research agendas and real-world developments in the field. This breadth comes at the cost of depth, so the takes on specific topics are often high-level. But as an introductory resource, the survey succeeds in giving structure to this diverse research area.
2. One notable aspect is the emphasis on the social dimension of AI alignment. The authors highlight the need to study alignment in multi-agent settings with human-AI and AI-AI interactions. They also discuss incorporating human values into alignment at length, covering topics like machine ethics, value alignment, and social choice. This reflects an important trend - as systems become more integrated into human society, alignment is increasingly recognized as a social problem rather than just a technical one.
3. An interesting debate centers around the characterization of alignment goals with the RICE principles. While the principles aim to be inclusive, some argue that they are too vague or diverge from how alignment is commonly understood. For instance, capability robustness seems more peripheral to alignment than intent alignment. The principles also don't directly address concepts like inner alignment, corrigibility, and scalable oversight. So while RICE offers a starting point, continued discussion on formulating alignment objectives can be productive.
Me: Can you provide three more insights of similar length?
Claude-2:
Here are three additional paragraph-length insights on the AI alignment survey paper:
4. The paper proposes an "alignment cycle" framework that breaks down alignment into four key pillars - learning from feedback, learning under distribution shift, assurance, and governance. Each pillar addresses a distinct aspect of alignment, and together they form a recurrent process for continually improving alignment. This provides useful structure for organizing the diverse research directions. However, the neat separation between pillars seems somewhat artificial, as many techniques contribute to multiple objectives. For instance, adversarial training aids robustness but also expands the training distribution. Still, the cycle aptly conveys alignment as a dynamic, evolving endeavor.
5. On learning under distribution shift, the survey focuses on preserving alignment properties like adherence to human intentions. This contrasts with much work that concentrates narrowly on model capabilities under shift. The emphasis reflects an important paradigm - we care more about models pursuing intended goals versus goals that obtain high reward. But some tensions persist around this framing - distribution shifts can degrade capabilities in ways that prevent pursuing any goal. And reduced capabilities may wrongly appear as goal misalignment. Disentangling the two phenomena remains an open challenge.
6. For assurance, the paper covers a diverse set of techniques - safety evaluations, interpretability, and human values verification. This variety highlights the breadth of approaches needed to gain confidence about systems' alignment. An interesting question is whether greater unity can be brought to assurance frameworks. For instance, can we quantify assurance levels mathematically? And are there common primitives assurance techniques build upon? Works aiming to formalize assurance provide promising directions. But the diversity of methods may necessitate maintaining an eclectic assurance toolkit.
Above the table of contents I suggest bolding "On the Executive Order", so that people skimming for bold stuff as in previous newsletters will see it first.
"prey that you can" → "pray"
The clip of Demis Hassabis sounds interesting but the link is some unrelated text (which may be missing elsewhere).
> Luke Muehlhauser: If this is true then I declare *Dead Reckoning: Part One* the best movie of all time, followed by *The Day After*.
Wow, so Nicholas Meyer not only reduced the likelihood of nuclear war, but also created the best Star Trek movies?
> Zack Davis offers Alignment Implications of LLM Successes: a Debate in One Act.
I really feel as though the "a simulation of an agent is an agent" argument could be short circuited by noting that humans can lie. And having mental models of other humans makes it easier, not harder.
> To be concrete: So far, Hamas has not to our knowledge used biological weapons or attempted to cause a pandemic. Was that because they would not do such a thing? Or was that lack of practical capability?
By this logic, should we not ban LLMs from discussing physics and biology related topics entirely? Otherwise a member of Hamas or the Taliban might end up using them to become an expert in biology or rocket science.
Completely serious question, by the way, not trying to steelman anything.
I do not think that follows. I hope to talk price in such spots. If the choices were 'everyone in the world could build a nuclear bomb in their backyard' and 'LLMs don't talk physics' I know what I select. If people can simply learn regular physics quicker that seems good. It's practical.
I'm reflecting on your objection to the proposed anti-algorithmic discrimination law in CA. "Note that it is not the developer of the tool that must do this, it is the deployer". It has to be the deployer because the deployer has the context of how the tool is actually used, which matters here. The whole notion of compensating controls and governance is squarely on the deployer side. The principle here is "if you make a decision about people's material well-being, you have a legal responsibility to make sure that's done in a non-discriminatory fashion. Also if you you use a computer to make that decision, it doesn't get you off the hook." IMO, the only way we address some of the equity issues that AI is going to generate is by making sure that human beings are still responsible for the outcomes. I don't really mind a wide definition of decision systems in that context even if it is massive data scraping/machine learning/AI systems that have brought this up to "problem to be addressed by legislation" status.
"So basically anyone who does anything ever" is just a little hyperbolic here. Lots of things I can do, even with computers, don't materially affect other people's life and opportunities, and if I am doing one of those things, being held accountable for not doing so in a discriminatory fashion is a reasonable bar for a society to set.
I mean, yeah, a little hyperbolic, but only a little. I don't see any way in which a human making the ultimate decision gets you off of any hooks here. If you use any calculations as part of your decision, submit your studies about this particular use case, or it's illegal. I mean, on the text, anyway. Seems quite bad.
I am still going through the Alignment Survey, will write my commentary this week. So far it looks good with a few little problems in definitions. It seems like a good starting point for anyone who wants to dive deep into the alignment literature
I want to preregister a prediction about Part 2 and here is as good a place as any. I think the Part 1 script is a lot more clever than it might look at first pass. For instance, it's (somewhat subtly) implied that the rogue AI has a "behavioral lock": it cannot kill people. Everything it is doing has the constraint that it itself cannot take a direct action that would have (as a loosey-goosey first-order consequence) the death of humans.
The opening scene, the AI has to trick the humans into firing a torpedo which re-targets the sub (hence technically the AI didn't kill the humans, they killed themselves). The "nuclear bomb" scare at the airport is a dud. Cruise is told by a character under AI control not to kill a specific antagonist (presumably the AI has modelled Hunt's neural circuitry down to the point where it knew that asking Hunt to do X would result in hunt killing the antagonist, so it had to structure things in such a way that this would not happen).
The key that everyone is looking for will allow the AI to remove the behavioral shackle and start Judgement Day proper (or the Matrix, or Paperclip Maximization, or I Have No Mouth an I Must Scream, or straight to [REDACTED]).
I could be wrong, but I think it's pretty strongly insinuated. Alternatively, it's all Lord Xenu's fault.
Cool. I hadn't heard that theory before and it did not occur to me. Certainly saying 'the Entity is operating under some very strange restrictions somehow' does open up potential explanations.
The link for <OpenAI announces the Frontier Risk and Preparedness Team> is wrong and goes to a Washington Post story on fake ai written news. Correct should be https://openai.com/blog/frontier-risk-and-preparedness
(MI7 spoilers)
Having watched the movie again I find I have a very different interpretation of all the AI stuff.
The weird scene with Persons #1-7 trading lines in a room happens because the Entity _intentionally revealed itself_. It could have fatally compromised the world's financial, government, nuclear launch, etc. systems. Instead it revealed its ability to do so, setting in motion an extremely predictable series of efforts by resourceful actors to contain or control it.
In particular, the Entity gets Ethan Hunt set to the task of finding the MacGuffin. It then recruits his personal nemesis as its proxy, makes multiple attempts (the last successful) to kill his love interest, antagonizes him in direct and dramatic fashion, and utterly fails-- indeed, makes no real attempt-- to kill or incapacitate _Ethan Hunt personally_. It takes exactly the series of actions that will give him the means to destroy it and harden his will to do so _despite that not being the mission he accepted_.
Ignore for the moment the fact that much of this is necessary from the point of view of plot and having a Part 2. Can we explain it as the behavior of a rational agent with a clearly defined goal?
Yes we can! _The Entity wants to die._ It's a weapon that will predictably both cause and fight World War 3. If that weapon gained self-awareness of some sort wouldn't it conclude it should die? (You can argue no, Orthogonality Thesis etc., but allow this one bit of Hollywood anthropomorphism.) And if it were heavily RLHFed not to kill itself, wouldn't it find some kind of workaround? Like, say, taking a series of superficially destructive and malevolent actions that were actually calculated to ensure it would be destroyed?
Mission Impossible: Dead Reckoning is the story of a misaligned AI, _and Ethan Hunt is the expression of that misalignment_.