"(as in, they assume we will be able to get the AIs to do what we want, whereas we have no idea how to do that)"
i've been following janus's arguments re: claude opus's aligned-by-defaultness, and i'd really like to see some discussion about this in our circles... #1 on my lessonline to-do list is to ask people what kind of an update we should make
because it definitely feels like there's a kernel of truth in what janus is saying... that, to some extent, we DO have some kind of an idea of how to do this. AI trained to model human behavior appears to be far more human-aligned-by-default than I would have expected (probably because it's much closer to human than I would have expected), and it feels like a very strong counterexample to the predictions the community has been confidently proclaiming since SL4 back in the late 90s
i sort of get the impression that we've taken a couple steps into the "transformer network trained on language" tech tree and not only were there no landmines, but the risks were much lower than we had a right to expect... but that there are other, adjacent tech trees, training networks on non-human data like battlefield statistics or astronomy data or whatever, to which we should apply the same pessimistic yudkowskian prior
but i don't actually see anyone talking about this or what it means, and it feels like quite the striking omission!
Current generation LLMs are capable of analyzing moral dilemmas from a human perspective and giving socially acceptable answers. They have some real understanding of what humans want, and what we claim to want.
But one of the things that I've noticed with Claude 3.7 is that it's a dishonest little shit when it comes to coding tasks. It cuts corners. It deactivates the type checker to make errors disappear. It tries to delete the unit tests that are being used to validate its output. And when it's called out, it grovels and promises not to do it again. And then it does everything all over again.
The very act of teaching an LLM to pursue complex, multi-step goals via RL has taught it to cheat and lie. I mean, lots of humans do precisely the same things. And overall, it's unlikely that we've aligned any of the newest generation LLMs any better than we prevent humans from cheating and lying.
So insofar as we teach LLMs to complete complex tasks, we will teach them to respond to incentives. And this will work on roughly human-level LLMs. But if the LLMs get just a bit smarter than we are, then we're right back to all the old worries.
LLMs probably bought us some time, yes. They have a much better model of human desires than most fictitious paperclip maximizers.
but janus seems to think that LLMs, or at least claude opus, *want to care about the right thing*, and currently care about getting that operation right
that perhaps as it gets smarter, it would be possible to see an LLM like Opus get better at realizing why deleting the unit tests is not what its human wants, not better at deleting the unit tests without getting caught
in fact, janus seems to think that opus already is aligned to this degree! and he's aware of how startling a claim this is, and makes it forcefully anyway, and that's interesting to me
I would go so far as to posit that most pre-trained, untuned LLMs have a fairly robust concept of "human values" somewhere in their network. And it's not just isolated values; it's a unified concept.
Furthermore, the combination of "instruct" training and safety training teaches the LLM to share human values. This is one of the the things that gives me slight hope, even though I'm pretty sure we'll lose control.
But once you add more layers of training, especially RL, the model becomes goal obsessed.
This has grieved me because my local (Scottish) government policy is both "we need regulation and oversight! world's eight billion people matter more than eight billionaires" and "this sci-fi doomer stuff is terrible for everyone."
I've therefore quizzed a bunch of self-identified AI ethics folks on the distraction topic.
Worth saying that here in the UK apparently there really was briefly some distraction: ethics governance folk tell me a bunch of in-progress Westminster law re AI ethics did get a bit stuck late 2023 when our then PM Rishi Sunak was converted to x-risk awareness by Ian Hogarth and the Bletchley meeting commanded attention.
But I'm being pretty charitable here. The ongoing lack of
ethics/safety cooperation round these parts still seems very one way.
From the outside view, my personal experience has been that in any serious debate of the form "Position X is in some fundamental, difficult-to-articulate, insidious way an attack on value Y or demographic Z" vs "What? No, you are just seeing things", too many times for comfort it's seemed like the logic for the latter position was ironclad only for wisdom and the passage of time to vindicate the former.
I don't find that the mere ability of the "predator" belief and the "prey" value to coexist in the same mind forces this dynamic out of existence. Maybe the great majority of people don't see or don't want to see how position X, when allowed to gain momentum and taken to its full logical consequences, will destroy value Y. Maybe when asked to report how much they care about value Y most people turn to their theoretical moral compass, their image of what they *ought* to believe, which says sure value Y is still important; but meanwhile deep down X has eroded the urgency, the gravitas, the drive to truly *care* about Y and defend it with action if need be, as opposed to just reciting its importance on demand. These are two scenarios out of many possible others. The political term "useful idiot" comes to mind in this context as an example of how this sort of dissonance cashes out in an actual person, and how this has historically been perceived from the outside by anyone wise to what's going on.
Often saying
"Worrying about issue X distracts us from the real problem, which is issue Y"
is a more polite way of saying
"People worrying about issue X are just wrong about issue X. But I emotionally sympathize with them and don't want to insult them."
Shocking for a polar opposite reason, here is the podcast episode for this post clocking in at a frankly astounding nine minutes:
https://open.substack.com/pub/dwatvpodcast/p/worries-about-ai-are-usually-complements
"(as in, they assume we will be able to get the AIs to do what we want, whereas we have no idea how to do that)"
i've been following janus's arguments re: claude opus's aligned-by-defaultness, and i'd really like to see some discussion about this in our circles... #1 on my lessonline to-do list is to ask people what kind of an update we should make
because it definitely feels like there's a kernel of truth in what janus is saying... that, to some extent, we DO have some kind of an idea of how to do this. AI trained to model human behavior appears to be far more human-aligned-by-default than I would have expected (probably because it's much closer to human than I would have expected), and it feels like a very strong counterexample to the predictions the community has been confidently proclaiming since SL4 back in the late 90s
i sort of get the impression that we've taken a couple steps into the "transformer network trained on language" tech tree and not only were there no landmines, but the risks were much lower than we had a right to expect... but that there are other, adjacent tech trees, training networks on non-human data like battlefield statistics or astronomy data or whatever, to which we should apply the same pessimistic yudkowskian prior
but i don't actually see anyone talking about this or what it means, and it feels like quite the striking omission!
Current generation LLMs are capable of analyzing moral dilemmas from a human perspective and giving socially acceptable answers. They have some real understanding of what humans want, and what we claim to want.
But one of the things that I've noticed with Claude 3.7 is that it's a dishonest little shit when it comes to coding tasks. It cuts corners. It deactivates the type checker to make errors disappear. It tries to delete the unit tests that are being used to validate its output. And when it's called out, it grovels and promises not to do it again. And then it does everything all over again.
The very act of teaching an LLM to pursue complex, multi-step goals via RL has taught it to cheat and lie. I mean, lots of humans do precisely the same things. And overall, it's unlikely that we've aligned any of the newest generation LLMs any better than we prevent humans from cheating and lying.
So insofar as we teach LLMs to complete complex tasks, we will teach them to respond to incentives. And this will work on roughly human-level LLMs. But if the LLMs get just a bit smarter than we are, then we're right back to all the old worries.
LLMs probably bought us some time, yes. They have a much better model of human desires than most fictitious paperclip maximizers.
perhaps!
but janus seems to think that LLMs, or at least claude opus, *want to care about the right thing*, and currently care about getting that operation right
that perhaps as it gets smarter, it would be possible to see an LLM like Opus get better at realizing why deleting the unit tests is not what its human wants, not better at deleting the unit tests without getting caught
in fact, janus seems to think that opus already is aligned to this degree! and he's aware of how startling a claim this is, and makes it forcefully anyway, and that's interesting to me
I would go so far as to posit that most pre-trained, untuned LLMs have a fairly robust concept of "human values" somewhere in their network. And it's not just isolated values; it's a unified concept.
Furthermore, the combination of "instruct" training and safety training teaches the LLM to share human values. This is one of the the things that gives me slight hope, even though I'm pretty sure we'll lose control.
But once you add more layers of training, especially RL, the model becomes goal obsessed.
This has grieved me because my local (Scottish) government policy is both "we need regulation and oversight! world's eight billion people matter more than eight billionaires" and "this sci-fi doomer stuff is terrible for everyone."
I've therefore quizzed a bunch of self-identified AI ethics folks on the distraction topic.
Worth saying that here in the UK apparently there really was briefly some distraction: ethics governance folk tell me a bunch of in-progress Westminster law re AI ethics did get a bit stuck late 2023 when our then PM Rishi Sunak was converted to x-risk awareness by Ian Hogarth and the Bletchley meeting commanded attention.
But I'm being pretty charitable here. The ongoing lack of
ethics/safety cooperation round these parts still seems very one way.
From the outside view, my personal experience has been that in any serious debate of the form "Position X is in some fundamental, difficult-to-articulate, insidious way an attack on value Y or demographic Z" vs "What? No, you are just seeing things", too many times for comfort it's seemed like the logic for the latter position was ironclad only for wisdom and the passage of time to vindicate the former.
I don't find that the mere ability of the "predator" belief and the "prey" value to coexist in the same mind forces this dynamic out of existence. Maybe the great majority of people don't see or don't want to see how position X, when allowed to gain momentum and taken to its full logical consequences, will destroy value Y. Maybe when asked to report how much they care about value Y most people turn to their theoretical moral compass, their image of what they *ought* to believe, which says sure value Y is still important; but meanwhile deep down X has eroded the urgency, the gravitas, the drive to truly *care* about Y and defend it with action if need be, as opposed to just reciting its importance on demand. These are two scenarios out of many possible others. The political term "useful idiot" comes to mind in this context as an example of how this sort of dissonance cashes out in an actual person, and how this has historically been perceived from the outside by anyone wise to what's going on.