I also disagree with Leopold("Pick up the phone and coordinate for a slowdown so we all live" is the hopeful model) but I do agree that This Is The Way - calm, object level presentations of values, beliefs and probabilities. This allows us to engage and understand perspectives, as well as find compromises in a way that the average Twitter level engagement has made impossible.
One thing I’d like to understand about Leopolds position is what he thinks his work on the super alignment team was doing to solve alignment and how its dissolution interacts with the odds of alignment being solved. Surely things are worse if there’s no dedicated effort right?
Correct me if I'm wrong, but my sense is that Leopold views alignment as essentially an issue of following instructions -- ensuring that the AI acts in accord with the desires of the humans making requests of it. (This from reading his blog - "The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do." https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/).
While I would argue that this not is a sufficiently expansive definition of the alignment problem (e.g., humans may make genocidal requests. Terminal values hard lines are potentially most of the marbles), isn't this more limited definition of Leopold's in some sense chiefly a capabilities-plus-fidelity issue, and thus potentially amenable to the same technological infrastructure used for training?
Inspired somewhat by the gerbils-AI example discussion yesterday: with a sufficiently robust internal model of a human interlocutor's mind (hell, maybe even take a connectome as input, although this is getting uncomfortably close to Torment Nexus examples), querying that internal model as to whether a proposed solution (and predictable outcomes of that solution above some threshold magnitude) is acceptable or what class of solution is desired, could an AI essentially self-align by treating the "this is okay / this is not okay" output token of that model as an input weight to its overall query response?
In this instance, the loss/utility function would be predictive fidelity ("would my interlocutor find this solution morally acceptable, were it proposed?"), the same as current LLM architectures optimize for, and thus not necessarily trivial to ignore or route around if it were used to inform pre-trained weights (maybe you have a separate, robust simulated-morality-brain at some point that is used to train the primary model so as to ensure that the primary model treats its generalized brain-analogue as a core system component instead of something to be routed around, and develops an internal module with such predictive fidelity). The basic goal would be to make generalizability of capabilities work *for* alignment instead of against it -- we would trust that sufficiently robust models would naturally develop an internal-human-interlocutor analogue similar to that of a real interlocutor because it would be the best way to optimize a real interlocutor's predictions.
This doesn't in and of itself solve the deception problem (incentives becomes to "fool oneself" if the internal mind-analogue is as limited in horsepower and perception as humans are) but it seems like it's at a minimum better than nothing and potentially much better as far as avoiding "Sorceror's Apprentice" unintended consequences regarding whether predicted effects are at least superficially within or out of scope of a desired request.
It does seem like it would gradually optimize for "self-deception in the interests of the AI", no? The AI that chooses to maintain humans as biological beings would use more resources to support humans than the AI that "convinces itself" that having copies of humans in a database is equivalent to preserving humans.
Granted I'm implicitly assuming one ASI here. Also that you'd have to run the "copies of humans in a database" proposal by the internal morality-circuit, which would give it thumbs down. But if such a result were a non-obvious corollary of inputs given to the internal morality police, you might find yourself with the AI incentivized to just get the paperclipping results using steganographic / long-term means.
Do the 'real' decision makers actually care about any of this, though? This is all a very fun read, don't get me wrong, but... do the people who own thousands of H100's care? Do the people in power care? I honestly have no idea, won't be surprised by either answer.
I don't understand at all what gives any reason to believe that once AI reaches the level of being able to replace a good AI researcher there is any kind of fast takeoff or quick improvement.
Ultimately, designing a better AI (relative to some metric) is searching the space of programs for one which scores better on some metric. Sure, maybe if you are a bit smarter you can develop tricks to speed up that search but at the same time each step thins out the space of programs that satisfy.
There is no a priori reason to think that that the speedup effect overwhelms the thinning out effect. And even if you do worry it might happen at some point there is no reason to identify the ability of current humans as the point where you get rapid acceleration. Indeed, that function could literally take any shape at all.
I mean an AI trying to improve its own abilities is essentially engaging in a form of theorem proving and both human and automated theorem proving involve a great deal of simply searching the space so why assume that the AI can speed this search up faster than the problem gets harder?
This kind of makes sense as an argument against a strong singularity, but surely there’s a clear discontinuity when you suddenly can just spin up 1000 researchers in parallel, all working at the same point in the progress per marginal effort curve as you had previously been working alone.
So if they build giant data centers the marginal power will probably be 100 percent zero-carbon.
2. With China the situation is the opposite. Installed capacity for solar and wind is growing so fast that the legacy coal plants will be rapidly mothballed, but they'll come back online if there's a commercial or military imperative to build huge data centers. So that's one more reason to restrict chip exports, in addition to the obvious ones.
3. Is permitting reform in the US a good idea or a bad idea?
Explain this to me like I'm five. Improved transmission capacity, etc., will have big economic and climate benefits but if power is the bottleneck for AGI, it will also have an effect on p(doom).
The effect on p(doom) easily dominates the other effects unless it's very small. What basis do we have for *not* thinking that more transmission lines would increase it?
Yes, I wouldn't underestimate the UAE's ability to rapidly shift to solar here; there's a ton of space, it's very sunny, they've got the money, no spats with China so can get the materials, and permitting isn't an issue if the kings don't want it to be.
I also fundamentally don't understand why the baseline assumption is that China is less likely to be cautious than the USA; has someone actually made this argument properly instead of just assuming it? I don't love the CCP's values, but 'lack of caution' doesn't strike me as a differentiator Vs the USA. Only one of these countries has launched nukes in anger and engaged in wars of choice in recent history.
I think Zvi has emphasized that China is being *more* cautious and isn't getting very impressive results, mainly because they don't want LLMs to say anything politically incorrect.
But it still seems as if a good policy to minimize both AI doom risk and greenhouse gas emissions is "don't let China have advanced chips".
What I want to know, though, is whether "don't increase US power generation capacity, at least not until there's some progress on alignment" is also a good policy.
If we agree that the right general approach is "NIMBY for technologies that might destroy humanity, but YIMBY for everything else", then I think electric power at the margin falls into the NIMBY category. Prove me wrong.
I'm comfortable with the squishy compromise argument (which Ashenbrenner explicitly argued against, as does Zvi) that "unlimited power, as long as it's renewable" is a workable compromise that still limits speed of deployment Vs mass fossil fuels whilst also, in the case that The Project isn't all it cracks up to be, still results in a good outcome.
But what exactly is an example of a technology that might not destroy humanity, in the sense of increasing p(doom)? Any technology surely must either free up human productivity or increase wealth, and both of those quantities are basically fungible into more productive AI research and production.
That's true, but incremental electric power for AI data centers presents exactly the same risk/benefit profile as the centers themselves. If there's an argument for delaying the arrival of AGI, there's an argument for not building the generating capacity that's the bottleneck for AGI.
No, my point is that it’s not *exactly* the same risk-benefit profile, because electric power is used for everything, not just data centers. Not all marginal new generation is going to go to data centers—or if it does it just means the counterfactual is actively cutting the amount of energy available to Americans living their lives to provide an AI bottleneck. That’s the kind of compromise that could conceivably be necessary but it essentially leads to a degrowth conclusion I’m loath to entertain so far.
True, but the situations where x-risk is highest are probably also situations where incremental power demand from data centers dwarfs incremental power demand from every other sector. That might be an early warning sign of high p(doom), in fact.
One thing to keep in mind is that because of US tariffs on Chinese photovoltaics, it's almost certainly going to be cheaper to add capacity in the UAE or some other non-US jurisdiction, regardless of whether US regulatory policy makes it easy or hard. We need to talk about pressuring Abu Dhabi not to pursue AGI... should be much easier than for China, since they're a small US ally
The Straussian view here would be that regardless of whether Leopold actually believes that acceleration is the way, communicating it this way will shock people into action more as people realize what kind of madness has taken foot.
> have the AI follow things like the constitution and separation of powers
I recognize Zvi's implication that this part seems included for political reasons. But taking it seriously:
1. Following the constitution means having an interpretation of it. As we see from the Supreme Court, interpretations differ.
2. Civil servants normally do what the president says, until a court tells them to stop. If an AI is going to play a civil servant while also acting as its own court, is that separation of powers?
3. What does the AI do if the president calls it disobedient and orders it to be re-programmed?
> I am discounting scenarios where we face unrelated fast existential risks or civilizational collapse.
Part of the upside to an ASI would be mitigating these risks. That said, the ASI might mitigate them before killing us all.
> This does not address the question of whether the Chinese are indeed po[i]sed to steal all the secrets from the labs. If so, and I think that it is so, then that seems bad. We should try and prevent that.
The traditional counter-intelligence approach might be an experiment. A leading US lab pretends it has made a breakthrough, and writes some code which would actually produce an easily visible signal if run. Then, see if any Chinese labs accidentally produce that signal.
> "Remarkable number of people in the replies here who think the way to deal with CCP stealing our secrets is to give away our secrets before they can be stolen?"
Ah, an instance of AI-type misalignment! This would indeed stop the CCP from stealing secrets, but it wouldn't actually meet the wishes of the people who're complaining about it.
China has a crucial advantage in the race to artificial intelligence that isn't widely discussed. This advantage lies in the vast wealth of text in the form of logograms, ideograms, and pictograms in their writings, both contemporary and historical. These logograms are even more constructive than words typically are in English, encoding a lot of intuitive common sense, physical, and social information.
I suspect there's more intrinsic information content in terms of logical connections within the structure of Southeast Asian languages than in European languages that rely only on phonetic representations. An ideal LLM will be trained on both the European phonetically based writings and the logogram-based writings of East Asia.
There was an attempt to hand-code the logical relationships in the Chinese language, similar to what was done in WordNet. This project, known as HowNet (知网 - Zhīwǎng), was developed in the late 1990s and early 2000s by Zhendong Dong and Qiang Dong. HowNet aims to represent the semantic relationships between concepts and their attributes in Chinese, as well as their English equivalents. Unlike WordNet, which focuses on lexical relationships, HowNet attempts to capture the internal structure of concepts using "sememes" - the smallest units of meaning. This approach leverages the compositional nature of Chinese characters to create a more fine-grained semantic network.
HowNet has been used in various natural language processing tasks, including word sense disambiguation, machine translation, and information retrieval. Despite facing challenges in development and maintenance, HowNet remains a significant contribution to the field of computational linguistics.
We need to gather as much training data as we can from China and then deeply ally with Japan and other countries such as Thailand or Chinese rivals like Vietnam. This will create a synergistic improvement in the reasoning and associational powers of LLM systems. It's likely that the Chinese will stumble upon this potential synergy, even if they aren't currently considering it.
On a similar note, the best LLMs should also include Slavic and Middle Eastern languages, particularly Arabic and Persian. While it might require extra effort to prevent outputs from becoming a creole, polyglot large language models that include logograms and diverse languages will be the most powerful.
Furthermore, using languages of origin such as Aramaic, Latin, and Proto-Germanic can also create a lot of intrinsic knowledge structures. While systems may already be exposed to this material, we might strengthen their reasoning abilities by emphasizing the logic of these languages through heavy early training on them.
I also disagree with Leopold("Pick up the phone and coordinate for a slowdown so we all live" is the hopeful model) but I do agree that This Is The Way - calm, object level presentations of values, beliefs and probabilities. This allows us to engage and understand perspectives, as well as find compromises in a way that the average Twitter level engagement has made impossible.
One thing I’d like to understand about Leopolds position is what he thinks his work on the super alignment team was doing to solve alignment and how its dissolution interacts with the odds of alignment being solved. Surely things are worse if there’s no dedicated effort right?
Lots of points of views in this, here is an AI "Narration" of this post where every unique quoted person is giver their own distinct "Voice"
https://askwhocastsai.substack.com/p/the-leopold-model-analysis-and-reactions
Leopold Aschenbrenner is not a serious person and Situational Awarenesss is not a serious essay. I think this got way more attention than it deserves.
Why do you think so? I may have missed an article from you.
Perhaps, but this is not a serious comment, so I'm struggling to see what you were going to accomplish with it. You'll need a little more than that.
Correct me if I'm wrong, but my sense is that Leopold views alignment as essentially an issue of following instructions -- ensuring that the AI acts in accord with the desires of the humans making requests of it. (This from reading his blog - "The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do." https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/).
While I would argue that this not is a sufficiently expansive definition of the alignment problem (e.g., humans may make genocidal requests. Terminal values hard lines are potentially most of the marbles), isn't this more limited definition of Leopold's in some sense chiefly a capabilities-plus-fidelity issue, and thus potentially amenable to the same technological infrastructure used for training?
Inspired somewhat by the gerbils-AI example discussion yesterday: with a sufficiently robust internal model of a human interlocutor's mind (hell, maybe even take a connectome as input, although this is getting uncomfortably close to Torment Nexus examples), querying that internal model as to whether a proposed solution (and predictable outcomes of that solution above some threshold magnitude) is acceptable or what class of solution is desired, could an AI essentially self-align by treating the "this is okay / this is not okay" output token of that model as an input weight to its overall query response?
In this instance, the loss/utility function would be predictive fidelity ("would my interlocutor find this solution morally acceptable, were it proposed?"), the same as current LLM architectures optimize for, and thus not necessarily trivial to ignore or route around if it were used to inform pre-trained weights (maybe you have a separate, robust simulated-morality-brain at some point that is used to train the primary model so as to ensure that the primary model treats its generalized brain-analogue as a core system component instead of something to be routed around, and develops an internal module with such predictive fidelity). The basic goal would be to make generalizability of capabilities work *for* alignment instead of against it -- we would trust that sufficiently robust models would naturally develop an internal-human-interlocutor analogue similar to that of a real interlocutor because it would be the best way to optimize a real interlocutor's predictions.
This doesn't in and of itself solve the deception problem (incentives becomes to "fool oneself" if the internal mind-analogue is as limited in horsepower and perception as humans are) but it seems like it's at a minimum better than nothing and potentially much better as far as avoiding "Sorceror's Apprentice" unintended consequences regarding whether predicted effects are at least superficially within or out of scope of a desired request.
It does seem like it would gradually optimize for "self-deception in the interests of the AI", no? The AI that chooses to maintain humans as biological beings would use more resources to support humans than the AI that "convinces itself" that having copies of humans in a database is equivalent to preserving humans.
Granted I'm implicitly assuming one ASI here. Also that you'd have to run the "copies of humans in a database" proposal by the internal morality-circuit, which would give it thumbs down. But if such a result were a non-obvious corollary of inputs given to the internal morality police, you might find yourself with the AI incentivized to just get the paperclipping results using steganographic / long-term means.
Do the 'real' decision makers actually care about any of this, though? This is all a very fun read, don't get me wrong, but... do the people who own thousands of H100's care? Do the people in power care? I honestly have no idea, won't be surprised by either answer.
I don't understand at all what gives any reason to believe that once AI reaches the level of being able to replace a good AI researcher there is any kind of fast takeoff or quick improvement.
Ultimately, designing a better AI (relative to some metric) is searching the space of programs for one which scores better on some metric. Sure, maybe if you are a bit smarter you can develop tricks to speed up that search but at the same time each step thins out the space of programs that satisfy.
There is no a priori reason to think that that the speedup effect overwhelms the thinning out effect. And even if you do worry it might happen at some point there is no reason to identify the ability of current humans as the point where you get rapid acceleration. Indeed, that function could literally take any shape at all.
I mean an AI trying to improve its own abilities is essentially engaging in a form of theorem proving and both human and automated theorem proving involve a great deal of simply searching the space so why assume that the AI can speed this search up faster than the problem gets harder?
This kind of makes sense as an argument against a strong singularity, but surely there’s a clear discontinuity when you suddenly can just spin up 1000 researchers in parallel, all working at the same point in the progress per marginal effort curve as you had previously been working alone.
Wait podcast?
Three thoughts about electric power:
1. Generation in the UAE was almost all gas-fired in 2010, but they're rapidly shifting to solar and nuclear:
https://en.m.wikipedia.org/wiki/Energy_in_the_United_Arab_Emirates
https://www.pv-magazine.com/2023/05/05/united-arab-emirates-solar-market/
So if they build giant data centers the marginal power will probably be 100 percent zero-carbon.
2. With China the situation is the opposite. Installed capacity for solar and wind is growing so fast that the legacy coal plants will be rapidly mothballed, but they'll come back online if there's a commercial or military imperative to build huge data centers. So that's one more reason to restrict chip exports, in addition to the obvious ones.
3. Is permitting reform in the US a good idea or a bad idea?
Explain this to me like I'm five. Improved transmission capacity, etc., will have big economic and climate benefits but if power is the bottleneck for AGI, it will also have an effect on p(doom).
The effect on p(doom) easily dominates the other effects unless it's very small. What basis do we have for *not* thinking that more transmission lines would increase it?
Yes, I wouldn't underestimate the UAE's ability to rapidly shift to solar here; there's a ton of space, it's very sunny, they've got the money, no spats with China so can get the materials, and permitting isn't an issue if the kings don't want it to be.
I also fundamentally don't understand why the baseline assumption is that China is less likely to be cautious than the USA; has someone actually made this argument properly instead of just assuming it? I don't love the CCP's values, but 'lack of caution' doesn't strike me as a differentiator Vs the USA. Only one of these countries has launched nukes in anger and engaged in wars of choice in recent history.
I think Zvi has emphasized that China is being *more* cautious and isn't getting very impressive results, mainly because they don't want LLMs to say anything politically incorrect.
But it still seems as if a good policy to minimize both AI doom risk and greenhouse gas emissions is "don't let China have advanced chips".
What I want to know, though, is whether "don't increase US power generation capacity, at least not until there's some progress on alignment" is also a good policy.
If we agree that the right general approach is "NIMBY for technologies that might destroy humanity, but YIMBY for everything else", then I think electric power at the margin falls into the NIMBY category. Prove me wrong.
I'm comfortable with the squishy compromise argument (which Ashenbrenner explicitly argued against, as does Zvi) that "unlimited power, as long as it's renewable" is a workable compromise that still limits speed of deployment Vs mass fossil fuels whilst also, in the case that The Project isn't all it cracks up to be, still results in a good outcome.
But what exactly is an example of a technology that might not destroy humanity, in the sense of increasing p(doom)? Any technology surely must either free up human productivity or increase wealth, and both of those quantities are basically fungible into more productive AI research and production.
That's true, but incremental electric power for AI data centers presents exactly the same risk/benefit profile as the centers themselves. If there's an argument for delaying the arrival of AGI, there's an argument for not building the generating capacity that's the bottleneck for AGI.
Also, please cut interest rates
No, my point is that it’s not *exactly* the same risk-benefit profile, because electric power is used for everything, not just data centers. Not all marginal new generation is going to go to data centers—or if it does it just means the counterfactual is actively cutting the amount of energy available to Americans living their lives to provide an AI bottleneck. That’s the kind of compromise that could conceivably be necessary but it essentially leads to a degrowth conclusion I’m loath to entertain so far.
True, but the situations where x-risk is highest are probably also situations where incremental power demand from data centers dwarfs incremental power demand from every other sector. That might be an early warning sign of high p(doom), in fact.
One thing to keep in mind is that because of US tariffs on Chinese photovoltaics, it's almost certainly going to be cheaper to add capacity in the UAE or some other non-US jurisdiction, regardless of whether US regulatory policy makes it easy or hard. We need to talk about pressuring Abu Dhabi not to pursue AGI... should be much easier than for China, since they're a small US ally
The Straussian view here would be that regardless of whether Leopold actually believes that acceleration is the way, communicating it this way will shock people into action more as people realize what kind of madness has taken foot.
> have the AI follow things like the constitution and separation of powers
I recognize Zvi's implication that this part seems included for political reasons. But taking it seriously:
1. Following the constitution means having an interpretation of it. As we see from the Supreme Court, interpretations differ.
2. Civil servants normally do what the president says, until a court tells them to stop. If an AI is going to play a civil servant while also acting as its own court, is that separation of powers?
3. What does the AI do if the president calls it disobedient and orders it to be re-programmed?
> I am discounting scenarios where we face unrelated fast existential risks or civilizational collapse.
Part of the upside to an ASI would be mitigating these risks. That said, the ASI might mitigate them before killing us all.
> This does not address the question of whether the Chinese are indeed po[i]sed to steal all the secrets from the labs. If so, and I think that it is so, then that seems bad. We should try and prevent that.
The traditional counter-intelligence approach might be an experiment. A leading US lab pretends it has made a breakthrough, and writes some code which would actually produce an easily visible signal if run. Then, see if any Chinese labs accidentally produce that signal.
> "Remarkable number of people in the replies here who think the way to deal with CCP stealing our secrets is to give away our secrets before they can be stolen?"
Ah, an instance of AI-type misalignment! This would indeed stop the CCP from stealing secrets, but it wouldn't actually meet the wishes of the people who're complaining about it.
It's gotten morbidly ridiculous. The guys fit for a YouTube channel, not a VC front for Sam Altman.
China has a crucial advantage in the race to artificial intelligence that isn't widely discussed. This advantage lies in the vast wealth of text in the form of logograms, ideograms, and pictograms in their writings, both contemporary and historical. These logograms are even more constructive than words typically are in English, encoding a lot of intuitive common sense, physical, and social information.
I suspect there's more intrinsic information content in terms of logical connections within the structure of Southeast Asian languages than in European languages that rely only on phonetic representations. An ideal LLM will be trained on both the European phonetically based writings and the logogram-based writings of East Asia.
There was an attempt to hand-code the logical relationships in the Chinese language, similar to what was done in WordNet. This project, known as HowNet (知网 - Zhīwǎng), was developed in the late 1990s and early 2000s by Zhendong Dong and Qiang Dong. HowNet aims to represent the semantic relationships between concepts and their attributes in Chinese, as well as their English equivalents. Unlike WordNet, which focuses on lexical relationships, HowNet attempts to capture the internal structure of concepts using "sememes" - the smallest units of meaning. This approach leverages the compositional nature of Chinese characters to create a more fine-grained semantic network.
HowNet has been used in various natural language processing tasks, including word sense disambiguation, machine translation, and information retrieval. Despite facing challenges in development and maintenance, HowNet remains a significant contribution to the field of computational linguistics.
We need to gather as much training data as we can from China and then deeply ally with Japan and other countries such as Thailand or Chinese rivals like Vietnam. This will create a synergistic improvement in the reasoning and associational powers of LLM systems. It's likely that the Chinese will stumble upon this potential synergy, even if they aren't currently considering it.
On a similar note, the best LLMs should also include Slavic and Middle Eastern languages, particularly Arabic and Persian. While it might require extra effort to prevent outputs from becoming a creole, polyglot large language models that include logograms and diverse languages will be the most powerful.
Furthermore, using languages of origin such as Aramaic, Latin, and Proto-Germanic can also create a lot of intrinsic knowledge structures. While systems may already be exposed to this material, we might strengthen their reasoning abilities by emphasizing the logic of these languages through heavy early training on them.
Cleaned up with 4o, content mine as usual.