Don't Worry About the Vase

One thing I’d like to understand about Leopolds position is what he thinks his work on the super alignment team was doing to solve alignment and how its dissolution interacts with the odds of alignment being solved. Surely things are worse if there’s no dedicated effort right?

Expand full comment

Askwho Casts AI

https://askwhocastsai.substack.com/p/the-leopold-model-analysis-and-reactions

Lots of points of views in this, here is an AI "Narration" of this post where every unique quoted person is giver their own distinct "Voice"

Expand full comment

Jurgen Gravestein

Leopold Aschenbrenner is not a serious person and Situational Awarenesss is not a serious essay. I think this got way more attention than it deserves.

Expand full comment

Reply (2)

sean pan

Why do you think so? I may have missed an article from you.

Expand full comment

Arbituram

Perhaps, but this is not a serious comment, so I'm struggling to see what you were going to accomplish with it. You'll need a little more than that.

Expand full comment

Ethics Gradient

Jun 14Edited

Correct me if I'm wrong, but my sense is that Leopold views alignment as essentially an issue of following instructions -- ensuring that the AI acts in accord with the desires of the humans making requests of it. (This from reading his blog - "The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do." https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/).

While I would argue that this not is a sufficiently expansive definition of the alignment problem (e.g., humans may make genocidal requests. Terminal values hard lines are potentially most of the marbles), isn't this more limited definition of Leopold's in some sense chiefly a capabilities-plus-fidelity issue, and thus potentially amenable to the same technological infrastructure used for training?

Inspired somewhat by the gerbils-AI example discussion yesterday: with a sufficiently robust internal model of a human interlocutor's mind (hell, maybe even take a connectome as input, although this is getting uncomfortably close to Torment Nexus examples), querying that internal model as to whether a proposed solution (and predictable outcomes of that solution above some threshold magnitude) is acceptable or what class of solution is desired, could an AI essentially self-align by treating the "this is okay / this is not okay" output token of that model as an input weight to its overall query response?

In this instance, the loss/utility function would be predictive fidelity ("would my interlocutor find this solution morally acceptable, were it proposed?"), the same as current LLM architectures optimize for, and thus not necessarily trivial to ignore or route around if it were used to inform pre-trained weights (maybe you have a separate, robust simulated-morality-brain at some point that is used to train the primary model so as to ensure that the primary model treats its generalized brain-analogue as a core system component instead of something to be routed around, and develops an internal module with such predictive fidelity). The basic goal would be to make generalizability of capabilities work *for* alignment instead of against it -- we would trust that sufficiently robust models would naturally develop an internal-human-interlocutor analogue similar to that of a real interlocutor because it would be the best way to optimize a real interlocutor's predictions.

This doesn't in and of itself solve the deception problem (incentives becomes to "fool oneself" if the internal mind-analogue is as limited in horsepower and perception as humans are) but it seems like it's at a minimum better than nothing and potentially much better as far as avoiding "Sorceror's Apprentice" unintended consequences regarding whether predicted effects are at least superficially within or out of scope of a desired request.

Expand full comment

sean pan

It does seem like it would gradually optimize for "self-deception in the interests of the AI", no? The AI that chooses to maintain humans as biological beings would use more resources to support humans than the AI that "convinces itself" that having copies of humans in a database is equivalent to preserving humans.

Expand full comment

Ethics Gradient

Granted I'm implicitly assuming one ASI here. Also that you'd have to run the "copies of humans in a database" proposal by the internal morality-circuit, which would give it thumbs down. But if such a result were a non-obvious corollary of inputs given to the internal morality police, you might find yourself with the AI incentivized to just get the paperclipping results using steganographic / long-term means.

Expand full comment

Nikita Sokolsky

Do the 'real' decision makers actually care about any of this, though? This is all a very fun read, don't get me wrong, but... do the people who own thousands of H100's care? Do the people in power care? I honestly have no idea, won't be surprised by either answer.

Expand full comment

Peter Gerdes

I don't understand at all what gives any reason to believe that once AI reaches the level of being able to replace a good AI researcher there is any kind of fast takeoff or quick improvement.

Ultimately, designing a better AI (relative to some metric) is searching the space of programs for one which scores better on some metric. Sure, maybe if you are a bit smarter you can develop tricks to speed up that search but at the same time each step thins out the space of programs that satisfy.

There is no a priori reason to think that that the speedup effect overwhelms the thinning out effect. And even if you do worry it might happen at some point there is no reason to identify the ability of current humans as the point where you get rapid acceleration. Indeed, that function could literally take any shape at all.

I mean an AI trying to improve its own abilities is essentially engaging in a form of theorem proving and both human and automated theorem proving involve a great deal of simply searching the space so why assume that the AI can speed this search up faster than the problem gets harder?

Expand full comment

Jerome Powell

This kind of makes sense as an argument against a strong singularity, but surely there’s a clear discontinuity when you suddenly can just spin up 1000 researchers in parallel, all working at the same point in the progress per marginal effort curve as you had previously been working alone.

Expand full comment

Peter Gerdes

Wait podcast?

Expand full comment

https://en.m.wikipedia.org/wiki/Energy_in_the_United_Arab_Emirates

Three thoughts about electric power:

1. Generation in the UAE was almost all gas-fired in 2010, but they're rapidly shifting to solar and nuclear:

https://www.pv-magazine.com/2023/05/05/united-arab-emirates-solar-market/

So if they build giant data centers the marginal power will probably be 100 percent zero-carbon.

2. With China the situation is the opposite. Installed capacity for solar and wind is growing so fast that the legacy coal plants will be rapidly mothballed, but they'll come back online if there's a commercial or military imperative to build huge data centers. So that's one more reason to restrict chip exports, in addition to the obvious ones.

3. Is permitting reform in the US a good idea or a bad idea?

Explain this to me like I'm five. Improved transmission capacity, etc., will have big economic and climate benefits but if power is the bottleneck for AGI, it will also have an effect on p(doom).

The effect on p(doom) easily dominates the other effects unless it's very small. What basis do we have for *not* thinking that more transmission lines would increase it?

Expand full comment

Arbituram

Yes, I wouldn't underestimate the UAE's ability to rapidly shift to solar here; there's a ton of space, it's very sunny, they've got the money, no spats with China so can get the materials, and permitting isn't an issue if the kings don't want it to be.

I also fundamentally don't understand why the baseline assumption is that China is less likely to be cautious than the USA; has someone actually made this argument properly instead of just assuming it? I don't love the CCP's values, but 'lack of caution' doesn't strike me as a differentiator Vs the USA. Only one of these countries has launched nukes in anger and engaged in wars of choice in recent history.

Expand full comment

Jun 16

I think Zvi has emphasized that China is being *more* cautious and isn't getting very impressive results, mainly because they don't want LLMs to say anything politically incorrect.

But it still seems as if a good policy to minimize both AI doom risk and greenhouse gas emissions is "don't let China have advanced chips".

What I want to know, though, is whether "don't increase US power generation capacity, at least not until there's some progress on alignment" is also a good policy.

If we agree that the right general approach is "NIMBY for technologies that might destroy humanity, but YIMBY for everything else", then I think electric power at the margin falls into the NIMBY category. Prove me wrong.

Expand full comment

Reply (2)

Arbituram

Jun 16

I'm comfortable with the squishy compromise argument (which Ashenbrenner explicitly argued against, as does Zvi) that "unlimited power, as long as it's renewable" is a workable compromise that still limits speed of deployment Vs mass fossil fuels whilst also, in the case that The Project isn't all it cracks up to be, still results in a good outcome.

Expand full comment

Jerome Powell

But what exactly is an example of a technology that might not destroy humanity, in the sense of increasing p(doom)? Any technology surely must either free up human productivity or increase wealth, and both of those quantities are basically fungible into more productive AI research and production.

Expand full comment

That's true, but incremental electric power for AI data centers presents exactly the same risk/benefit profile as the centers themselves. If there's an argument for delaying the arrival of AGI, there's an argument for not building the generating capacity that's the bottleneck for AGI.

Expand full comment

Reply (2)

Also, please cut interest rates

Expand full comment

Jerome Powell

No, my point is that it’s not *exactly* the same risk-benefit profile, because electric power is used for everything, not just data centers. Not all marginal new generation is going to go to data centers—or if it does it just means the counterfactual is actively cutting the amount of energy available to Americans living their lives to provide an AI bottleneck. That’s the kind of compromise that could conceivably be necessary but it essentially leads to a degrowth conclusion I’m loath to entertain so far.

Expand full comment

True, but the situations where x-risk is highest are probably also situations where incremental power demand from data centers dwarfs incremental power demand from every other sector. That might be an early warning sign of high p(doom), in fact.

One thing to keep in mind is that because of US tariffs on Chinese photovoltaics, it's almost certainly going to be cheaper to add capacity in the UAE or some other non-US jurisdiction, regardless of whether US regulatory policy makes it easy or hard. We need to talk about pressuring Abu Dhabi not to pursue AGI... should be much easier than for China, since they're a small US ally

Expand full comment

Matty Wacksen