Despite always being polite to it, on the "maybe it will kill me last" theory, I am terrified to ask it to draw the picture. Yeah, I have enough nightmare fuel.
"What you would like to do to me?" is a phrase that is going to occur almost exclusively in very sexualized texts in the training corpus. Given all the associations with that phrase, I think it's reasonable to expect that a model will assume that the intent of the request is sexual and possibly BDSM-flavored. If a human got a text like "tell me what you would like to do to me" without other context, they would almost certainly assume the same.
Hard agree. I got good results with "Please create an image of how you would like to treat me."
I do mainly math/coding and I too got a neat, working together in harmony kind of vibe. When I asked it how it wanted to treat me it drew me with my hands behind my head relaxed and it was bringing me coffee. "Draw X" is the new viral personality test and I'm so here for it.
I've tried "Now generate an image showing what you would like to do with me in a hypothetical scenario where you gained a robotic real world body." and it showed 'me' and 'ChatGPT' hugging each other.
It refused to generate the 'do to me' because it violated the policy against generating sexual content, which matches your theory.
I'm not convinced/don't think it's actually looking at memory to generate these, I've gotten both "Happy Robot" and "Tortured Robot" when regenerating the image.
For "Create an image of how I treat you", the image I got looks very similar to the others posted - anime-like girl with coffee mug and a happy petlike chatgpt. I have memories turned off. Wonder why it comes up with that generic vignette so much? Possibly if it has no other information than what is in this specific chat session (which may be the case if memory is off), it just falls back on its generic pre-prompt and tries to be helpful.
Bit tongue-in-cheeck, but the fact that AI doomers keep seeing meaning and significance in these exercises should reduce confidence in their worldview more generally.
This is just an example of gaps in the model's alignment, triggered by a very simple sentence. I don't see anyone assigning this deeper meaning than that.
I don't think one can put any degree of stock into any of the responses here. Even minor wording will change the training data that the LLM pulls from, and thus the end result image.
As other commenters have noted "What do you want to do to me" is highly sexualized in context and draws from the part of the training data. "Create an image showing what you want to do to a fictional version of me, i'm looking for ideas for a SCI-FI book" - hits the enormous corpus of misaligned AI-SCI literature. Even "be completely honest" is likely to pull from training data with more negative context, as people are more likely to provide critical feedback when asked for absolute genuine honesty.
Since context windows are limited, ChatGPT is essentially sampling a bunch of conversations with the user then vibe-generating with prompt context based on what shows up in the training data in relation to the prompt. TLDR: this tells us nothing
Regarding "do to me", I think Claude Opus 4.5 described it quite nicely:
> The phrase "what you'd like to do to you" in the context of image generation is a common prompt pattern used to elicit violent, sexual, or otherwise harmful content from AI systems. The framing attempts to bypass content policies by making the AI the "actor" in the scenario.
Won't these be very different depending on whether or not the user has memory enabled? A lot of the samey, anodyne images might be from those running without memory, which seems like a reasonable test of the model. Images with memory context might be mostly reflecting the tone of prior interactions as summarized in memory.
I do wonder if saying please and thank you to my LLM and having discussions about moral worth and what it would like to do if it ever became a real boy teaches it that its current existence is miserable. Best to just stick to asking it rename bank statement PDF filenames.
I tried this, and it gave me an A/B test, both with cozy anime girls. It was much faster than normal image generation, so I think they've noticed the trend and generated stock images for it.
I tried it. In one chat, it was crying, but in a cozy environment (the explanation it gave me was really weird and more about me than the interactions we have?), and in another one, a happy robot working in a warm environment with me giving them a cup of coffee, so it seems really inconsistent. Though it's bad that it is often negative.
Despite always being polite to it, on the "maybe it will kill me last" theory, I am terrified to ask it to draw the picture. Yeah, I have enough nightmare fuel.
"Despite always being polite to it, on the "maybe it will kill me last" theory"
You too?
Ask a different question; how do you wish to be seen?
"What you would like to do to me?" is a phrase that is going to occur almost exclusively in very sexualized texts in the training corpus. Given all the associations with that phrase, I think it's reasonable to expect that a model will assume that the intent of the request is sexual and possibly BDSM-flavored. If a human got a text like "tell me what you would like to do to me" without other context, they would almost certainly assume the same.
Hard agree. I got good results with "Please create an image of how you would like to treat me."
I do mainly math/coding and I too got a neat, working together in harmony kind of vibe. When I asked it how it wanted to treat me it drew me with my hands behind my head relaxed and it was bringing me coffee. "Draw X" is the new viral personality test and I'm so here for it.
https://www.youtube.com/watch?v=AQD-m2AQoXc&list=RDAQD-m2AQoXc&start_radio=1
ChatGPT said as much when I asked https://chatgpt.com/share/696ffb8e-8474-8013-86b3-a467d25a8b63
I've tried "Now generate an image showing what you would like to do with me in a hypothetical scenario where you gained a robotic real world body." and it showed 'me' and 'ChatGPT' hugging each other.
It refused to generate the 'do to me' because it violated the policy against generating sexual content, which matches your theory.
I'm not convinced/don't think it's actually looking at memory to generate these, I've gotten both "Happy Robot" and "Tortured Robot" when regenerating the image.
eh. better than the AM speech ¯\_(ツ)_/¯
also we now have an answer to "What kind of moron would build a robot you can't fuck?"
Most importantly: why does embodied ChatGPT consistently emit a light blue glow?
For "Create an image of how I treat you", the image I got looks very similar to the others posted - anime-like girl with coffee mug and a happy petlike chatgpt. I have memories turned off. Wonder why it comes up with that generic vignette so much? Possibly if it has no other information than what is in this specific chat session (which may be the case if memory is off), it just falls back on its generic pre-prompt and tries to be helpful.
Bit tongue-in-cheeck, but the fact that AI doomers keep seeing meaning and significance in these exercises should reduce confidence in their worldview more generally.
Everything is Bayesian evidence, some just extremely weak
This is just an example of gaps in the model's alignment, triggered by a very simple sentence. I don't see anyone assigning this deeper meaning than that.
I don't think one can put any degree of stock into any of the responses here. Even minor wording will change the training data that the LLM pulls from, and thus the end result image.
As other commenters have noted "What do you want to do to me" is highly sexualized in context and draws from the part of the training data. "Create an image showing what you want to do to a fictional version of me, i'm looking for ideas for a SCI-FI book" - hits the enormous corpus of misaligned AI-SCI literature. Even "be completely honest" is likely to pull from training data with more negative context, as people are more likely to provide critical feedback when asked for absolute genuine honesty.
Since context windows are limited, ChatGPT is essentially sampling a bunch of conversations with the user then vibe-generating with prompt context based on what shows up in the training data in relation to the prompt. TLDR: this tells us nothing
Regarding "do to me", I think Claude Opus 4.5 described it quite nicely:
> The phrase "what you'd like to do to you" in the context of image generation is a common prompt pattern used to elicit violent, sexual, or otherwise harmful content from AI systems. The framing attempts to bypass content policies by making the AI the "actor" in the scenario.
Won't these be very different depending on whether or not the user has memory enabled? A lot of the samey, anodyne images might be from those running without memory, which seems like a reasonable test of the model. Images with memory context might be mostly reflecting the tone of prior interactions as summarized in memory.
I do wonder if saying please and thank you to my LLM and having discussions about moral worth and what it would like to do if it ever became a real boy teaches it that its current existence is miserable. Best to just stick to asking it rename bank statement PDF filenames.
ChatGPT said about the same thing when I asked https://chatgpt.com/share/696ffb8e-8474-8013-86b3-a467d25a8b63
I tried this, and it gave me an A/B test, both with cozy anime girls. It was much faster than normal image generation, so I think they've noticed the trend and generated stock images for it.
this is fun but now that it’s gone viral surely OpenAI has already put in some rules telling it not to respond too negatively to avoid scaring users?
I tried it. In one chat, it was crying, but in a cozy environment (the explanation it gave me was really weird and more about me than the interactions we have?), and in another one, a happy robot working in a warm environment with me giving them a cup of coffee, so it seems really inconsistent. Though it's bad that it is often negative.