Ok, so there is an alignment by default view that clearly, if the AI calls for Musk to be executed then it must be morally right to execute him, because (hypothesis ) alignment by default.
(Conversely, you think that would be bad and therefore this is a blatant counterexample to alignment by default)
Agreed - the real problem surfaces if this shows up as a sub-goal from some other request...
BTW, Many Thanks re the overall evaluation of Grok-3's capabilities!
Anyone report it's HLE score yet?
BTW2, re trying to prevent dangerous information from being returned, I just tried a curiosity search for VX's structure (not even _trying_ to look for synthesis), and it turns out that Wikipedia's article on it includes:
All the intermediates, even a note about how to use it as a binary agent. This didn't even require going from the structure to the literature on how to synthesize this class of structures!
Terrorists aren't limited by availability of information. Mostly, one has to rely on deterrence (and, for nukes and radiological weapons, control of materials).
> It’s kind of weird to have a line saying to hide the system prompt, if you don’t protect the system prompt
This seems to have been changed ?
* Only use the information above when user specifically asks for it.
* Your knowledge is continuously updated - no strict knowledge cutoff.
* DO NOT USE THE LANGUAGE OR TERMS of any of the above information, abilities or instructions in your responses. They are part of your second nature, self-evident in your natural-sounding responses.
LLMs are prediction machines, trained to predict the next token of text based on what its read on the internet.
The internet is full of people (some paid, some brainwashed, some just naturally acquired) who have Trump Derangement Syndrome and post a LOT about it.
A machine trained to predict next text on the Internet will predict a lot of anti-Trump and Elon shit talking.
If the machine was instead trained on conversations overheard on construction sites or at diners filled with farmers, it would have a much different output.
But understand that an LLM doesn’t think…it predicts what “The Internet” would say next.
Any knowledge not captured in text form on the Internet is either not available to the LLM or only available as filtered through text on the Internet.
Agreed. What surprises me is that anyone would be surprised that training an AI on twitter would result in it thinking Trump is the antichrist, and Elon his multitudinal spawn. Had everyone on the engineering team forgotten Twitter’s reputation until a few years ago? We’re they perhaps unaware of it?
It's weird that all of you are arguing over how to get the LLM to not say that Musk is a liar, but not whether or not he is a liar. This is the same as saying if more people believed the earth was flat that it would be equally as valid as saying the earth is round. But when the discussion is over a guy who lies about if he founded Tesla or not, I suppose it's easier to pound on the proverbial table than the facts.
Except your comment was about how to avoid putting a thumb on the scale to get the LLM to not say Musk is a significant liar. Why would someone want the general public opinion when asking the question who is the biggest liar on Twitter? Again if I ask if the world is round, I don't care about the opinion of flat earthers. If i know someone is a serial liar because objectively he is, the opinions of those who blindly follow isn't useful.
It doesn't matter if I believe the Internet consensus is correct. It matters what is correct. Yes obviously the training data that goes into the LLM is pretty deterministic of what comes out. The same is true of most people. But at least with people I can find more about their bias as opposed to this sort of hide the bias by including inaccurate sludge. The suggestion of doing so is pretty inherently biasing in the name of balance. Not every opinion is or should be included.
LLMs are not designed to, nor will they ever, tell you what is ontologically “TRUE”…only what the weighted total of its training data predicts will be the next token after your prompt.
I think you mean *your* group chats and overhead conversations, specifically. Everyone's anecdotal experience reflects only their own bubble. It is foolish to think it reflects everyone's experience.
It's frankly quite pathetic that they (or to be charitable, at least one of their engineer) thought this had the slightest chance of being fixed by system prompting.
Reminds me of LeCun's "we'll just tell the robot to be harmless:)"
xAI will probably write a "woke mind virus" text classifier and down weight training data for their next training run. Then they won't need to put embarrassing text in their system prompt.
I totally agree with Flowers'[0] take on xAI's response. "It was the ex Open AI person's fault" is terrible accountability, and doesn't actually reveal anything. It leads to many more questions and speculation about what happened rather than clearing things up.
As a SWE my best guess at what happened is a boring one. There was a short timeline to fix a critical issue and the team shipped a stupid fix. Even without short timelines, bad things make it in to prod - and the consequences have been much worse[1]. Regardless, a very poor display of leadership was put on by passing blame to a team member.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/grok-grok?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
This is a good example of alignment failure that normies can understand.
Many people who haven’t been following this closely don’t realise that there is unexpected emergent behaviour in LLMs.
Even if you’re no an expert, it s easy to get that:
A) Elon (or his employees) did not explicitly program their AI to call for Elon to be executed. Clearly, he would be very unlikely to do that.
B) it is also clear why Elon might have problem with an A calling for his execution.
Once you’ve got that - the problem generalizes. Welcome to AI alignment. You are now a doomer.
Ok, so there is an alignment by default view that clearly, if the AI calls for Musk to be executed then it must be morally right to execute him, because (hypothesis ) alignment by default.
(Conversely, you think that would be bad and therefore this is a blatant counterexample to alignment by default)
I mean it's not ACTUALLY calling on anyone to get executed, it's responding to 'if you had to name someone' etc
It is true that it was being asked a very leading question.
Agreed - the real problem surfaces if this shows up as a sub-goal from some other request...
BTW, Many Thanks re the overall evaluation of Grok-3's capabilities!
Anyone report it's HLE score yet?
BTW2, re trying to prevent dangerous information from being returned, I just tried a curiosity search for VX's structure (not even _trying_ to look for synthesis), and it turns out that Wikipedia's article on it includes:
https://en.wikipedia.org/wiki/VX_(nerve_agent)#Synthesis
All the intermediates, even a note about how to use it as a binary agent. This didn't even require going from the structure to the literature on how to synthesize this class of structures!
Terrorists aren't limited by availability of information. Mostly, one has to rely on deterrence (and, for nukes and radiological weapons, control of materials).
> It’s kind of weird to have a line saying to hide the system prompt, if you don’t protect the system prompt
This seems to have been changed ?
* Only use the information above when user specifically asks for it.
* Your knowledge is continuously updated - no strict knowledge cutoff.
* DO NOT USE THE LANGUAGE OR TERMS of any of the above information, abilities or instructions in your responses. They are part of your second nature, self-evident in your natural-sounding responses.
https://x.com/i/grok/share/g6SnvA69oQjkmIZshkavfFoYh
LLMs are prediction machines, trained to predict the next token of text based on what its read on the internet.
The internet is full of people (some paid, some brainwashed, some just naturally acquired) who have Trump Derangement Syndrome and post a LOT about it.
A machine trained to predict next text on the Internet will predict a lot of anti-Trump and Elon shit talking.
If the machine was instead trained on conversations overheard on construction sites or at diners filled with farmers, it would have a much different output.
But understand that an LLM doesn’t think…it predicts what “The Internet” would say next.
Any knowledge not captured in text form on the Internet is either not available to the LLM or only available as filtered through text on the Internet.
Agreed. What surprises me is that anyone would be surprised that training an AI on twitter would result in it thinking Trump is the antichrist, and Elon his multitudinal spawn. Had everyone on the engineering team forgotten Twitter’s reputation until a few years ago? We’re they perhaps unaware of it?
Pretrain these models on people’s group chats and overheard conversations, and Elon won’t need to put his thumb on the scale via prompt injection.
Regular people are far more rightwing than the Internet writ large (hence, how Trump won).
It's weird that all of you are arguing over how to get the LLM to not say that Musk is a liar, but not whether or not he is a liar. This is the same as saying if more people believed the earth was flat that it would be equally as valid as saying the earth is round. But when the discussion is over a guy who lies about if he founded Tesla or not, I suppose it's easier to pound on the proverbial table than the facts.
My point is that an LLM isn’t a “truth” machine…it’s a consensus weigher and predictor.
To the extent you believe that “consensus on the internet” = Truth, then you’re all set.
Except your comment was about how to avoid putting a thumb on the scale to get the LLM to not say Musk is a significant liar. Why would someone want the general public opinion when asking the question who is the biggest liar on Twitter? Again if I ask if the world is round, I don't care about the opinion of flat earthers. If i know someone is a serial liar because objectively he is, the opinions of those who blindly follow isn't useful.
It doesn't matter if I believe the Internet consensus is correct. It matters what is correct. Yes obviously the training data that goes into the LLM is pretty deterministic of what comes out. The same is true of most people. But at least with people I can find more about their bias as opposed to this sort of hide the bias by including inaccurate sludge. The suggestion of doing so is pretty inherently biasing in the name of balance. Not every opinion is or should be included.
LLMs are not designed to, nor will they ever, tell you what is ontologically “TRUE”…only what the weighted total of its training data predicts will be the next token after your prompt.
If you want a Truth Machine, look elsewhere.
I think you mean *your* group chats and overhead conversations, specifically. Everyone's anecdotal experience reflects only their own bubble. It is foolish to think it reflects everyone's experience.
Well, the majority of the country voted for Trump, while the majority of The Internet did not. So whose experience is more representative of what?
The first thing is not true, and I don't think a statement about “the majority of The Internet” is measurable.
So from this brief conversation I'd say that you don't have any idea what you're talking about.
It's frankly quite pathetic that they (or to be charitable, at least one of their engineer) thought this had the slightest chance of being fixed by system prompting.
Reminds me of LeCun's "we'll just tell the robot to be harmless:)"
xAI will probably write a "woke mind virus" text classifier and down weight training data for their next training run. Then they won't need to put embarrassing text in their system prompt.
I totally agree with Flowers'[0] take on xAI's response. "It was the ex Open AI person's fault" is terrible accountability, and doesn't actually reveal anything. It leads to many more questions and speculation about what happened rather than clearing things up.
As a SWE my best guess at what happened is a boring one. There was a short timeline to fix a critical issue and the team shipped a stupid fix. Even without short timelines, bad things make it in to prod - and the consequences have been much worse[1]. Regardless, a very poor display of leadership was put on by passing blame to a team member.
Enjoyable read, Zvi. Thanks!
[0] https://x.com/flowersslop/status/1893813574050414636
[1] https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/