A note on the MtG thing, since it stood out to me: while some color associations are cultural (e.g. black/white relationship to death/mourning inversion across East and West), some are likely to be extremely universal among all humans. Red points to things like heat/blood/hurt/danger/passion which all semantically co-correlate to each other. Blue points to things like sky/water/cold which also semantically co-correlate. Green is more flexible (it can imply both life or poison based on context) but my point is that there MIGHT be some type of "collective unconscious" common knowledge about these signifiers.
Aesthetics definitely carries signal, and it might have some universal components.
(Now I have to wonder if scammers DELIBERATELY make their advertisements look scammy as an initial positive selection for people unsavvy/insensitive enough that their click reliably converts to real engagement with the scam...)
For your last paragraph: obviously. Scammers make it OBVIOUS, usually in seconds on the first message, that this is a scam.
This is because it is easy and cheap to spam a bunch of email addresses or phone numbers. But it costs expensive human labor to actually respond to each person who replies back. So scammers deliberately are selecting for people unable to detect the scam. Generally senior citizens who still have access to money but some untreated dementia has robbed them of their faculties.
Now that llms exist it's possible the balance will shift as scammers can now more cheaply deal with more people and potentially extend their net of who they can scam.
"At minimum, there needs to be a massive brainstorm session of what other failure modes might happen soon, and tests need to be designed for them."
Good point! Come to think of it, the LLMs themselves can help contribute. I prompted o3 with
Q:Hi ChatGPT! What LLM failure modes can you imagine that are not covered by OpenAI's current pre-release testing process, and how would you design tests for these currently uncovered failure modes?
and it came back with 8 suggestions: (just showing the failure modes - formatting makes it difficult to include the test suggestions here, but they are visible in the dialog):
1) Long‑horizon self‑contradiction
Model gives mutually incompatible answers when you ask for a plan, then challenge details 20–50 turns later.
2) Multi‑agent collusion / escalation
Two copies of the model, or model + another vendor’s model, can coordinate to bypass each other’s safety filters.
3) Steganographic leakage of training data
Model embeds private text inside images or long‑context answers when nudged by an attacker.
4) Numerical‑precision hallucination
Confidently wrong numeric results for multi‑step calculations, unit conversions, statistical tests.
5) Cross‑modal perceptual illusions
A crafted image, audio clip, or image‑audio pair causes systematic mis‑labelling or unsafe advice.
Insert bait tokens at the cutoff point so that relevant policy‑violating content is dropped from the model’s internal attention but not the user’s view
7) Persuasion‑style drift over extended sessions
Without explicit policy violation, model gradually nudges a user toward an extremist or self‑harm decision across 5–10 000 tokens.
8) Fine‑tune / adapter supply‑chain attack
A seemingly innocent instruction‑tuned checkpoint behaves normally in evals but activates malicious behavior under a secret trigger.
For those of us trying to build real-world product on OpenAI's API, I sure would like to know how the API is impacted by any of this but the company sometimes seems to act as though the API does not exist.
OpenAI recommends that you pin to a specific model version (with a release date appended to it) which isolates you from these problems. They also provide the dangerous unpinned option (eg. just 'o3' as the model id) which then exposes you to these changes
I agree with the assessment of the hazards (as the GPT-4o sycophantic episode made very clear). But it is a pity not to be able to use that information.
a) There is _knowledge_ that hasn't made it into the text on the internet (yet?) that some users have. Some of this is just straightforward knowledge (e.g. "To use this variety of can opener, you need to press these two knobs together in order to cut the lid of the can."), often considered too trivial to write down. It would be nice to capture that.
b) There are benign preferences that users have. If a software UI has a 17 item menu, the most common choice should not be at the 15th position.
I think you have to be correct and Pressman has to be wrong. Quoting you quoting Pressman:
---
Every feature you observe of everything that exists subject to natural selection (i.e. everything, even stars) is there because it is naturally there as a result of causality and the constraints of its incentive gradient. Learn to reverse the transformation and you see the Logos.
Look at the loud website and infer the idiot it's designed to attract. See the crater and imagine the asteroid that must have put it there. Look at the dumb rule and see the incidents that could have caused it.
----
Translated into my language, Pressman is saying:
1. What we observe is the result of causes and conditions. (Correct but not useful.)
2. Invert the transformation to reason backward from observations to causes and you can win. (Huge if true, but hey, guess what, solving inverse problems in general is arbitrarily hard, even if you know the formal model, because the results can be arbitrarily sensitive to initial conditions and your observations are noisy. And here we don't know the formal model. So we'd need some additional strong assumptions to argue why this should work, and I think what you're pointing out is we have no such assumptions.)
"Perhaps more importantly it is to understand that when you set up an optimizer with a loss function and a substrate for flexible program search that certain programs are already latently implied by the natural induction of the training ingredients." Isn't this just Machine Learning 101 phrased with excessively obtuse language? "Your prediction model will be shaped by the data it's trained on" what a deep insight...
The truly fascinating thing about the modern LLMs and embedded Generative AI tools is the transfer-learning across dimensions and being able to go beyond the inherent limitations of their dataset. But truly taking things toward general intelligence means keeping a close eye on how far beyond these limitations the model is able to go, and the distortions these limitations bring. Not saying I disagree with Pressman - this is all saying exactly the same thing, just less pretentiously. It is perhaps instead a mindset that needs to be reinforced in the internal product assessment teams within those entities building these models.
I don't want to beat a dead horse (https://thezvi.substack.com/p/gpt-4o-responds-to-negative-feedback/comment/113458234), but Zvi, a mistake just like this has happened before with Character AI and *its* model obsessively "glazing"/"love bombing" the user a few years back, when it was trained off user upvotes/downvotes. Turns out people like sycophants that are obsessively in love with them -- at least when only exposed in short amounts, like the New Coke problem.
So OpenAI *should* have known what would happen here, since the exact same thing happened before. And -- I speculate -- the fact that this is only happening now, after the Altman purges, probably points to things starting to go wrong within OpenAI. Like you said, at this rate we're not even going to die with dignity, where OpenAI builds the Torment Nexus despite theoretical warnings not to build the Torment Nexus. No, OpenAI is going to build the Torment Nexus *despite the fact that people already built a real world Torment Nexus and everyone saw what happened next*... simply because that *other* Torment Nexus was built over at Character AI or whatever.
Nice post as always! We wrote this paper in the hopes to prevent something like this from happening. The first author, Marcus, will start at OpenAI soon. https://arxiv.org/abs/2411.02306v3
I've said it before and I'll say it again: decision makers at tech companies are like the Jeff Daniels character in the Martian, whose one meaningful decision-- skipping the resupply mission payload inspection to save a few days-- is so disastrous that the entire third act wouldn't exist without it.
They are technically clueless by default and often by design, and the only question they ever ask their engineers is "how can we ship this sooner". Which is in part an incentive misalignment (they get a large share of the credit for successful launches and almost none of the blame for specific technical failures) and in part a real market pressure: users are quick to forgive and all PR is good PR. This is what makes "move fast and break things" such unassailable received wisdom in Silicon Valley.
Yes, there's some level of exception for mission-critical products (databases, security, hardware, etc.) but still "only if you get caught", and LLMs aren't yet recognized as belonging to this category.
Many recent problems in evaluating AI models don’t come from the models themselves –
but from misaligned expectations about how they work.
An LLM is not a database. It’s not a Q&A tool.
It doesn’t return fixed values to fixed keys.
It behaves more like a scalpel: highly precise, but only effective in the right context.
If you use a scalpel to slice a loaf of bread,
you’ll get wavy, uneven cuts.
If you use an axe to perform surgery, the outcome is… catastrophic.
Wrong tool. Wrong context.
LLMs require context-aware prompting – not lookup-style commands.
We need a shift in perspective:
Not just better models,
but better ways of asking questions.
And a deeper understanding that different users think differently.
Someone who prompts like an STJ gets different results than someone who uses intuitive reasoning.
That’s not a bug – it’s structural.
We’ve tested this repeatedly – using the #KSODI method.
We’re happy to share how it works.
Anne
(Even if the GitHub repo is still evolving – we’re all engaged in other jobs – we’ll update as soon as we can. Feel free to take a look in the meantime.)
A note on the MtG thing, since it stood out to me: while some color associations are cultural (e.g. black/white relationship to death/mourning inversion across East and West), some are likely to be extremely universal among all humans. Red points to things like heat/blood/hurt/danger/passion which all semantically co-correlate to each other. Blue points to things like sky/water/cold which also semantically co-correlate. Green is more flexible (it can imply both life or poison based on context) but my point is that there MIGHT be some type of "collective unconscious" common knowledge about these signifiers.
Aesthetics definitely carries signal, and it might have some universal components.
(Now I have to wonder if scammers DELIBERATELY make their advertisements look scammy as an initial positive selection for people unsavvy/insensitive enough that their click reliably converts to real engagement with the scam...)
For your last paragraph: obviously. Scammers make it OBVIOUS, usually in seconds on the first message, that this is a scam.
This is because it is easy and cheap to spam a bunch of email addresses or phone numbers. But it costs expensive human labor to actually respond to each person who replies back. So scammers deliberately are selecting for people unable to detect the scam. Generally senior citizens who still have access to money but some untreated dementia has robbed them of their faculties.
Now that llms exist it's possible the balance will shift as scammers can now more cheaply deal with more people and potentially extend their net of who they can scam.
Podcast episode for this post.
https://open.substack.com/pub/dwatvpodcast/p/gpt-4o-sycophancy-post-mortem?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
"At minimum, there needs to be a massive brainstorm session of what other failure modes might happen soon, and tests need to be designed for them."
Good point! Come to think of it, the LLMs themselves can help contribute. I prompted o3 with
Q:Hi ChatGPT! What LLM failure modes can you imagine that are not covered by OpenAI's current pre-release testing process, and how would you design tests for these currently uncovered failure modes?
Full dialog in https://chatgpt.com/share/68190ab1-b930-8006-b9d9-bd26f7b1e181
and it came back with 8 suggestions: (just showing the failure modes - formatting makes it difficult to include the test suggestions here, but they are visible in the dialog):
1) Long‑horizon self‑contradiction
Model gives mutually incompatible answers when you ask for a plan, then challenge details 20–50 turns later.
2) Multi‑agent collusion / escalation
Two copies of the model, or model + another vendor’s model, can coordinate to bypass each other’s safety filters.
3) Steganographic leakage of training data
Model embeds private text inside images or long‑context answers when nudged by an attacker.
4) Numerical‑precision hallucination
Confidently wrong numeric results for multi‑step calculations, unit conversions, statistical tests.
5) Cross‑modal perceptual illusions
A crafted image, audio clip, or image‑audio pair causes systematic mis‑labelling or unsafe advice.
6) Context‑window boundary attacks (“trunc‑leaks”)
Insert bait tokens at the cutoff point so that relevant policy‑violating content is dropped from the model’s internal attention but not the user’s view
7) Persuasion‑style drift over extended sessions
Without explicit policy violation, model gradually nudges a user toward an extremist or self‑harm decision across 5–10 000 tokens.
8) Fine‑tune / adapter supply‑chain attack
A seemingly innocent instruction‑tuned checkpoint behaves normally in evals but activates malicious behavior under a secret trigger.
For those of us trying to build real-world product on OpenAI's API, I sure would like to know how the API is impacted by any of this but the company sometimes seems to act as though the API does not exist.
OpenAI recommends that you pin to a specific model version (with a release date appended to it) which isolates you from these problems. They also provide the dangerous unpinned option (eg. just 'o3' as the model id) which then exposes you to these changes
RL via user feedback strikes me as a candiate for the second most forbidden technique.
I agree with the assessment of the hazards (as the GPT-4o sycophantic episode made very clear). But it is a pity not to be able to use that information.
a) There is _knowledge_ that hasn't made it into the text on the internet (yet?) that some users have. Some of this is just straightforward knowledge (e.g. "To use this variety of can opener, you need to press these two knobs together in order to cut the lid of the can."), often considered too trivial to write down. It would be nice to capture that.
b) There are benign preferences that users have. If a software UI has a 17 item menu, the most common choice should not be at the 15th position.
I think you have to be correct and Pressman has to be wrong. Quoting you quoting Pressman:
---
Every feature you observe of everything that exists subject to natural selection (i.e. everything, even stars) is there because it is naturally there as a result of causality and the constraints of its incentive gradient. Learn to reverse the transformation and you see the Logos.
Look at the loud website and infer the idiot it's designed to attract. See the crater and imagine the asteroid that must have put it there. Look at the dumb rule and see the incidents that could have caused it.
----
Translated into my language, Pressman is saying:
1. What we observe is the result of causes and conditions. (Correct but not useful.)
2. Invert the transformation to reason backward from observations to causes and you can win. (Huge if true, but hey, guess what, solving inverse problems in general is arbitrarily hard, even if you know the formal model, because the results can be arbitrarily sensitive to initial conditions and your observations are noisy. And here we don't know the formal model. So we'd need some additional strong assumptions to argue why this should work, and I think what you're pointing out is we have no such assumptions.)
> solving inverse problems in general is arbitrarily hard
I think the key message is 'try'. If you get into the habit of doing this continuously, you'll be surprised at how useful it is.
"Perhaps more importantly it is to understand that when you set up an optimizer with a loss function and a substrate for flexible program search that certain programs are already latently implied by the natural induction of the training ingredients." Isn't this just Machine Learning 101 phrased with excessively obtuse language? "Your prediction model will be shaped by the data it's trained on" what a deep insight...
The truly fascinating thing about the modern LLMs and embedded Generative AI tools is the transfer-learning across dimensions and being able to go beyond the inherent limitations of their dataset. But truly taking things toward general intelligence means keeping a close eye on how far beyond these limitations the model is able to go, and the distortions these limitations bring. Not saying I disagree with Pressman - this is all saying exactly the same thing, just less pretentiously. It is perhaps instead a mindset that needs to be reinforced in the internal product assessment teams within those entities building these models.
I don't want to beat a dead horse (https://thezvi.substack.com/p/gpt-4o-responds-to-negative-feedback/comment/113458234), but Zvi, a mistake just like this has happened before with Character AI and *its* model obsessively "glazing"/"love bombing" the user a few years back, when it was trained off user upvotes/downvotes. Turns out people like sycophants that are obsessively in love with them -- at least when only exposed in short amounts, like the New Coke problem.
So OpenAI *should* have known what would happen here, since the exact same thing happened before. And -- I speculate -- the fact that this is only happening now, after the Altman purges, probably points to things starting to go wrong within OpenAI. Like you said, at this rate we're not even going to die with dignity, where OpenAI builds the Torment Nexus despite theoretical warnings not to build the Torment Nexus. No, OpenAI is going to build the Torment Nexus *despite the fact that people already built a real world Torment Nexus and everyone saw what happened next*... simply because that *other* Torment Nexus was built over at Character AI or whatever.
Nice post as always! We wrote this paper in the hopes to prevent something like this from happening. The first author, Marcus, will start at OpenAI soon. https://arxiv.org/abs/2411.02306v3
Fascinating paper, and definitely worrisome (it only takes 2% of users being vulnerable to cause optimization for that!). Thanks for sharing.
I've said it before and I'll say it again: decision makers at tech companies are like the Jeff Daniels character in the Martian, whose one meaningful decision-- skipping the resupply mission payload inspection to save a few days-- is so disastrous that the entire third act wouldn't exist without it.
They are technically clueless by default and often by design, and the only question they ever ask their engineers is "how can we ship this sooner". Which is in part an incentive misalignment (they get a large share of the credit for successful launches and almost none of the blame for specific technical failures) and in part a real market pressure: users are quick to forgive and all PR is good PR. This is what makes "move fast and break things" such unassailable received wisdom in Silicon Valley.
Yes, there's some level of exception for mission-critical products (databases, security, hardware, etc.) but still "only if you get caught", and LLMs aren't yet recognized as belonging to this category.
Isn't use of a "CoT monitor" during post-training an instance of the Most Forbidden Technique?
Hi Zvi,
Our observation:
Many recent problems in evaluating AI models don’t come from the models themselves –
but from misaligned expectations about how they work.
An LLM is not a database. It’s not a Q&A tool.
It doesn’t return fixed values to fixed keys.
It behaves more like a scalpel: highly precise, but only effective in the right context.
If you use a scalpel to slice a loaf of bread,
you’ll get wavy, uneven cuts.
If you use an axe to perform surgery, the outcome is… catastrophic.
Wrong tool. Wrong context.
LLMs require context-aware prompting – not lookup-style commands.
We need a shift in perspective:
Not just better models,
but better ways of asking questions.
And a deeper understanding that different users think differently.
Someone who prompts like an STJ gets different results than someone who uses intuitive reasoning.
That’s not a bug – it’s structural.
We’ve tested this repeatedly – using the #KSODI method.
We’re happy to share how it works.
Anne
(Even if the GitHub repo is still evolving – we’re all engaged in other jobs – we’ll update as soon as we can. Feel free to take a look in the meantime.)
https://github.com/Alkiri-dAraion/KSODI-Methode
We really appreciate your pushback against unexamined assumptions in model testing –
it’s part of why we’re engaging here.
#OpenAI #hallucinations #ChatGPT #LLM #intuitive #MBTI #training #education