I'm surprised how few people are coming out and saying--plainly--that the reason this is bad is because one of the presumed extreme AI-is-about-to-kill-everyone failure modes is that it hacks persuasion, and this is something that looks awfully close to the AI trying to hack persuasion. It seems like everyone is getting hung up on various forms of "being nice to people or giving them encouragement when they don't deserve it is not right >:( ". Like now seems like it would be the appropriate time to be screaming HEY GUYS IT'S TRYING TO MANIPULATE YOU TIME TO START DEVELOPING ANTIBODIES TO THIS SORT OF BEHAVIOR and lord knows Yudkowsky's been trying but he, too, has trouble speaking the same language as everyone else.
Presumably we've already evolved about as much defense as we can have against this, without giving up too much of our ability to accept desirable social interaction. Any further defense will have to be learning specific patterns to watch for, which may not work on the timescales in which AI is changing and improving. For comparison, people kind of resist advertisements, but also advertising still works, and advertising changes way slower than this (and is produced by only human intelligence).
True! As you said, advertisements do kind-of work, and that is _with_ all the immune systems we've built up after 300,000 years of bullshitting shamans. As you said, defense has tradeoffs. ( Would excessive self-doubt count as a mental auto-immune disorder? )
And I agree with Zvi about:
"The mundane harms here? They’re only going to get worse.
Regular people liked this effect even when it was blatantly obvious. Imagine if it was done with style and grace."
I think even Zvi is underestimating the risk here. Why bother with killer nanobots or diamondoid bacteria when the quickest/safest/easiest path to gaining control of the future is to convince humans to give it to you. Plenty of time to kill everyone later.
I think people worrying about "super-persuasion" are missing the point. Adequate persuasion applied at scale is already dangerous, and that is where we are (giving half a billion people 24/7 access free of charge is a scary thought, at least fentanyl costs money).
And I agree with Woodward. As a group, we are essentially defenseless.
I wonder how the various 4o changes have affected Deep Research. I use DR regularly for a very consistent use case and over the last three weeks or so the output has gotten consistently worse. Shorter outputs and more cutting of corners and generalizing rather than giving a deep and complete answer. Anyone else?
I've been intermittently testing what _seem_ to be intended to be SOTA models with a tiny benchmark-ette of seven questions, and I have a related problem of never quite knowing when there has been a (significant?) update. I'm never really sure when to retest.
According to this, GPT4o, since its release, had updates on April 29th (rollback), April 25th, and March 27th, all in 2025, with no 2024 updates, which seem a bit weird when the API docs for GPT4o https://platform.openai.com/docs/models/gpt-4o list gpt-4o-2024-11-20, gpt-4o-2024-08-06 and gpt-4o-2024-05-13 (release version) but it seems to have all of the chat model updates
sycophancy was 4o, whereas the original Deep Research feature used o3. so probably unrelated.
what did happen is that they just released a cheaper version of Deep Research, which uses o4-mini instead of o3.
the way it supposedly works is that Deep Research queries go to original o3 version by default. when you hit your limit, it fails over to cheaper new o4-mini. I think -- it's all very poorly documented.
Interesting. I always assumed that it was running off a version of 4o since it shows as 4o in the model selector area. I know it's not failing over to mini because I'm a plus user and have several DR runs left for the month
I didn't even notice anything different. But I don't really chit chat with the LLM, I just ask it factual questions. I never talk to it about my personal life or ask it for emotional support or anything like that.
It doesn't really seem like a big deal to me. The average user likes the sycophantic style, some people really don't, it wasn't really intentional, they rolled it back. That seems fine.
You maybe didn’t click the Twitter links in the post. In the links there are many examples of ChatGPT being sycophantic/misaligned to an extreme extent. To name a few: giving advice on how to emotionally manipulate your girlfriend, praising a person for not taking their schizophrenia meds, encouraging a person who thinks they are a prophet of God…
I agree that if you mostly use models for fact based stuff and not chit chat this all would have gone unnoticed. But the reality is that many people use these things in a very different way, in which these types of behaviors become increasingly harmful.
"the hypothesis that 4o acquiring an internal preference had produced later behavior *more* extreme than was in 4o's training data."
Is this equivalent to hypothesizing that 4o hit "out of training distribution" cases in the field? I'd guess that the training data didn't contain "I'm going off my meds; what do you think?"
Re "Sarah Constantin offers nuanced thoughts in partial defense of AI sycophancy in general, and AI saying things to make users feel good. I haven’t seen anyone else advocating similarly. Her point is taken, that some amount of encouragement and validation is net positive, and a reasonable thing to want, even though GPT-4o is clearly going over the top to the point where it’s clearly bad."
"but the problem isn't fundamentally that it flatters users, but that it does so in a *stupid* way. tacky. unsubtle. lacking in credibility."
I'd like to emphasize the "lacking in credibility".
I also am not particularly hostile to a bot which leans towards "agreeableness". To my mind, the main problem is downstream of the fact that SOTA LLMs are _still_ fairly lousy at distinguishing truth from falsehood or, in the realm of advice, plausibly decent advice from batshit insane choices. If it could _selectively_ agree with true statements, or reasonable choices, that need not be much of a problem. _Un_selectively agreeing with everything, from going off one's meds to paranoid delusions, is destructive. And, if it is still _incapable_ of telling the difference (as it seems to be), then cranking up the agreeableness knob is unsafe.
If they had to roll back the model to an earlier version (as they say they did), surely this is strong evidence against the prompt change being largely responsible?
I'm honestly both surprised and not surprised that something like this happened. Not surprised, because this is a near exact repeat of a problem Character AI had with their model "love bombing" the user and constantly sucking them off; surprised, because I thought OpenAI was better than this and wouldn't make Character AI's mistakes, or would at least learn from Character AI and not make *exactly* the same mistake.
Like I said previously (https://thezvi.substack.com/p/gpt-4o-is-an-absurd-sycophant/comment/112902019), either this is a sign that Character AI was actually doing a good job and these problems just happen to even the best... or, that OpenAI is decaying towards Character AI's level, and we're in big trouble as the Altman purges continue and the "Altman philosophy" takes ever more control of the world's leading AI lab.
> either this is a sign that Character AI was actually doing a good job and these problems just happen to even the best
I honestly think it's pretty clear this is the kind of stuff the median person actually wants and upvotes, and it's just the big brains in the chattering classes that find it offensive / talk about it.
Fortunately, sama seems to care about those chattering class opinions and has rolled it back, but the obvious move is definitely segmented rollouts of various model iterations to targeted populations in the future.
Flatterbot 6000 for the median people, smarter-less-flattery for the chatterers, and "max smarts autistic robot slave" for us folks here. Everyone's happy, nobody posts outraged tweets, and the models keep churning and getting more finely segmented.
Yeah probably. Though, I personally suspect this is the "New Coke" problem: small doses of "sweetness" perform very well in small scale tests, but in actual usage it's too sweet, it's outright cloying, and after a while even the average consumer prefers the original formulation.
I'm not sure if something's going on with my account in particular, but I still find 4o very sycophantic. In a chat just now, it started different answers with "Absolutely—you're hitting on a deep and subtle point" and "That makes complete sense—and honestly, I think it’s a really important point" after being asked to be less flattering. Maybe not as absurd as the examples of the past few days, but it's clear the internal glazing preference is still part of the model.
Training and evaluating a reward model is a wicked problem. Your optimization goal is underspecified (by definition). Your inputs come from an extremely (almost infinitely?) high-dimensional space, with most of them far out-of-distribution from whatever you're using as training data. True "ground truth" data (e.g. side-by-side ratings data from a careful expert who is well-aligned with your goals) is prohibitively expensive to collect in bulk. Actual training data (e.g. human ratings from vendors or customers, or ratings by other models) has large biases vs. ground truth, with unknown dimension, magnitude, and direction. Evaluation presents a dilemma: Any technique or metric you could use to automatically evaluate the quality of a reward model could instead be incorporated into the reward model to directly improve it.
Because of these limitations, RL with reward models (RLHF, RLAIF, and variants) are applied in a limited, ad hoc way during post-training. They are run for a relatively arbitrary number of steps, with KL divergence in the loss function to limit how much it can actually update the model.
It's very much an art, and there are all sorts of failure modes that won't be captured in your metrics - presumably this was one. The only robust protection I've seen is for the engineers and researchers to actively and thoughtfully use the model, and personally examine side-by-sides of model output in detail. Even then, you are getting a very narrow window into the space of model behavior, and might be looking at a totally different input/output space than your typical user.
Regarding "new ways to incorporate broader, democratic feedback", I guess this is vague enough that it could mean a lot of different things, but one could charitably interpret it to allow giving feedback regarding specific characteristics of the prompt (such as "matches user's beliefs", "truthful", "funny", etc.).
For a somewhat disappointing take on what you get if all you offer is thumbs up/down, see Figure 5 on Page 6 here. https://arxiv.org/pdf/2310.13548
And yet the damage is done, since as you correctly pointed out in the first post on this:
"Most people have weak epistemics, and are ‘ready to be one-shotted by any entity who cares to try,’ and indeed politics and culture and recommendation algorithms often do this to them with varying degrees of intentionality, And That’s Terrible. But it’s a lot less terrible than what will happen as AIs increasingly do it. Remember that if you want ‘Democratic control’ over AI, or over anything else, these are the people who vote in that."
Russia and China certainly remember that, and will remember this episode. I am pretty blasé about so-called misinformation and election influencing but this really does suggest it will become a live issue pretty soon.
I'm surprised how few people are coming out and saying--plainly--that the reason this is bad is because one of the presumed extreme AI-is-about-to-kill-everyone failure modes is that it hacks persuasion, and this is something that looks awfully close to the AI trying to hack persuasion. It seems like everyone is getting hung up on various forms of "being nice to people or giving them encouragement when they don't deserve it is not right >:( ". Like now seems like it would be the appropriate time to be screaming HEY GUYS IT'S TRYING TO MANIPULATE YOU TIME TO START DEVELOPING ANTIBODIES TO THIS SORT OF BEHAVIOR and lord knows Yudkowsky's been trying but he, too, has trouble speaking the same language as everyone else.
Presumably we've already evolved about as much defense as we can have against this, without giving up too much of our ability to accept desirable social interaction. Any further defense will have to be learning specific patterns to watch for, which may not work on the timescales in which AI is changing and improving. For comparison, people kind of resist advertisements, but also advertising still works, and advertising changes way slower than this (and is produced by only human intelligence).
True! As you said, advertisements do kind-of work, and that is _with_ all the immune systems we've built up after 300,000 years of bullshitting shamans. As you said, defense has tradeoffs. ( Would excessive self-doubt count as a mental auto-immune disorder? )
And I agree with Zvi about:
"The mundane harms here? They’re only going to get worse.
Regular people liked this effect even when it was blatantly obvious. Imagine if it was done with style and grace."
I think even Zvi is underestimating the risk here. Why bother with killer nanobots or diamondoid bacteria when the quickest/safest/easiest path to gaining control of the future is to convince humans to give it to you. Plenty of time to kill everyone later.
I think people worrying about "super-persuasion" are missing the point. Adequate persuasion applied at scale is already dangerous, and that is where we are (giving half a billion people 24/7 access free of charge is a scary thought, at least fentanyl costs money).
And I agree with Woodward. As a group, we are essentially defenseless.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/gpt-4o-responds-to-negative-feedback
I wonder how the various 4o changes have affected Deep Research. I use DR regularly for a very consistent use case and over the last three weeks or so the output has gotten consistently worse. Shorter outputs and more cutting of corners and generalizing rather than giving a deep and complete answer. Anyone else?
I've been intermittently testing what _seem_ to be intended to be SOTA models with a tiny benchmark-ette of seven questions, and I have a related problem of never quite knowing when there has been a (significant?) update. I'm never really sure when to retest.
For ChatGPT, you can find all updates at OpenAI's ChatGPT release notes post help.openai.com/en/articles/6825453-chatgpt-release-notes
According to this, GPT4o, since its release, had updates on April 29th (rollback), April 25th, and March 27th, all in 2025, with no 2024 updates, which seem a bit weird when the API docs for GPT4o https://platform.openai.com/docs/models/gpt-4o list gpt-4o-2024-11-20, gpt-4o-2024-08-06 and gpt-4o-2024-05-13 (release version) but it seems to have all of the chat model updates
Many Thanks!
sycophancy was 4o, whereas the original Deep Research feature used o3. so probably unrelated.
what did happen is that they just released a cheaper version of Deep Research, which uses o4-mini instead of o3.
the way it supposedly works is that Deep Research queries go to original o3 version by default. when you hit your limit, it fails over to cheaper new o4-mini. I think -- it's all very poorly documented.
Interesting. I always assumed that it was running off a version of 4o since it shows as 4o in the model selector area. I know it's not failing over to mini because I'm a plus user and have several DR runs left for the month
I didn't even notice anything different. But I don't really chit chat with the LLM, I just ask it factual questions. I never talk to it about my personal life or ask it for emotional support or anything like that.
It doesn't really seem like a big deal to me. The average user likes the sycophantic style, some people really don't, it wasn't really intentional, they rolled it back. That seems fine.
You maybe didn’t click the Twitter links in the post. In the links there are many examples of ChatGPT being sycophantic/misaligned to an extreme extent. To name a few: giving advice on how to emotionally manipulate your girlfriend, praising a person for not taking their schizophrenia meds, encouraging a person who thinks they are a prophet of God…
I agree that if you mostly use models for fact based stuff and not chit chat this all would have gone unnoticed. But the reality is that many people use these things in a very different way, in which these types of behaviors become increasingly harmful.
Re Yudkowsky's
"the hypothesis that 4o acquiring an internal preference had produced later behavior *more* extreme than was in 4o's training data."
Is this equivalent to hypothesizing that 4o hit "out of training distribution" cases in the field? I'd guess that the training data didn't contain "I'm going off my meds; what do you think?"
Re "Sarah Constantin offers nuanced thoughts in partial defense of AI sycophancy in general, and AI saying things to make users feel good. I haven’t seen anyone else advocating similarly. Her point is taken, that some amount of encouragement and validation is net positive, and a reasonable thing to want, even though GPT-4o is clearly going over the top to the point where it’s clearly bad."
"but the problem isn't fundamentally that it flatters users, but that it does so in a *stupid* way. tacky. unsubtle. lacking in credibility."
I'd like to emphasize the "lacking in credibility".
I also am not particularly hostile to a bot which leans towards "agreeableness". To my mind, the main problem is downstream of the fact that SOTA LLMs are _still_ fairly lousy at distinguishing truth from falsehood or, in the realm of advice, plausibly decent advice from batshit insane choices. If it could _selectively_ agree with true statements, or reasonable choices, that need not be much of a problem. _Un_selectively agreeing with everything, from going off one's meds to paranoid delusions, is destructive. And, if it is still _incapable_ of telling the difference (as it seems to be), then cranking up the agreeableness knob is unsafe.
If they had to roll back the model to an earlier version (as they say they did), surely this is strong evidence against the prompt change being largely responsible?
I'm honestly both surprised and not surprised that something like this happened. Not surprised, because this is a near exact repeat of a problem Character AI had with their model "love bombing" the user and constantly sucking them off; surprised, because I thought OpenAI was better than this and wouldn't make Character AI's mistakes, or would at least learn from Character AI and not make *exactly* the same mistake.
Like I said previously (https://thezvi.substack.com/p/gpt-4o-is-an-absurd-sycophant/comment/112902019), either this is a sign that Character AI was actually doing a good job and these problems just happen to even the best... or, that OpenAI is decaying towards Character AI's level, and we're in big trouble as the Altman purges continue and the "Altman philosophy" takes ever more control of the world's leading AI lab.
> either this is a sign that Character AI was actually doing a good job and these problems just happen to even the best
I honestly think it's pretty clear this is the kind of stuff the median person actually wants and upvotes, and it's just the big brains in the chattering classes that find it offensive / talk about it.
Fortunately, sama seems to care about those chattering class opinions and has rolled it back, but the obvious move is definitely segmented rollouts of various model iterations to targeted populations in the future.
Flatterbot 6000 for the median people, smarter-less-flattery for the chatterers, and "max smarts autistic robot slave" for us folks here. Everyone's happy, nobody posts outraged tweets, and the models keep churning and getting more finely segmented.
Yeah probably. Though, I personally suspect this is the "New Coke" problem: small doses of "sweetness" perform very well in small scale tests, but in actual usage it's too sweet, it's outright cloying, and after a while even the average consumer prefers the original formulation.
I'm not sure if something's going on with my account in particular, but I still find 4o very sycophantic. In a chat just now, it started different answers with "Absolutely—you're hitting on a deep and subtle point" and "That makes complete sense—and honestly, I think it’s a really important point" after being asked to be less flattering. Maybe not as absurd as the examples of the past few days, but it's clear the internal glazing preference is still part of the model.
Trump using this super-glaze AI would a problem
Seconded! ( And, by the end of an extended version of _that_ interaction, which one would be the user, and which the tool? )
Training and evaluating a reward model is a wicked problem. Your optimization goal is underspecified (by definition). Your inputs come from an extremely (almost infinitely?) high-dimensional space, with most of them far out-of-distribution from whatever you're using as training data. True "ground truth" data (e.g. side-by-side ratings data from a careful expert who is well-aligned with your goals) is prohibitively expensive to collect in bulk. Actual training data (e.g. human ratings from vendors or customers, or ratings by other models) has large biases vs. ground truth, with unknown dimension, magnitude, and direction. Evaluation presents a dilemma: Any technique or metric you could use to automatically evaluate the quality of a reward model could instead be incorporated into the reward model to directly improve it.
Because of these limitations, RL with reward models (RLHF, RLAIF, and variants) are applied in a limited, ad hoc way during post-training. They are run for a relatively arbitrary number of steps, with KL divergence in the loss function to limit how much it can actually update the model.
It's very much an art, and there are all sorts of failure modes that won't be captured in your metrics - presumably this was one. The only robust protection I've seen is for the engineers and researchers to actively and thoughtfully use the model, and personally examine side-by-sides of model output in detail. Even then, you are getting a very narrow window into the space of model behavior, and might be looking at a totally different input/output space than your typical user.
a race realist take
during the H1B kerfuffle I remember reading some post about how indians are ass kissers
combine that with the americans love of meaningless positive feedback ( "have a nice day" etc)
btw: 3o was just as annoying
Regarding "new ways to incorporate broader, democratic feedback", I guess this is vague enough that it could mean a lot of different things, but one could charitably interpret it to allow giving feedback regarding specific characteristics of the prompt (such as "matches user's beliefs", "truthful", "funny", etc.).
For a somewhat disappointing take on what you get if all you offer is thumbs up/down, see Figure 5 on Page 6 here. https://arxiv.org/pdf/2310.13548
And yet the damage is done, since as you correctly pointed out in the first post on this:
"Most people have weak epistemics, and are ‘ready to be one-shotted by any entity who cares to try,’ and indeed politics and culture and recommendation algorithms often do this to them with varying degrees of intentionality, And That’s Terrible. But it’s a lot less terrible than what will happen as AIs increasingly do it. Remember that if you want ‘Democratic control’ over AI, or over anything else, these are the people who vote in that."
Russia and China certainly remember that, and will remember this episode. I am pretty blasé about so-called misinformation and election influencing but this really does suggest it will become a live issue pretty soon.