How much credence do we assign to Roon's proclamations here? I am very interested to understand if and how o1 is actually better than preview and than non test time compute models.
From a competitive point of view, doing the system card thing hurts OpenAI. OpenAI is currently winning the competition, especially in branding. The more transparent they are about their metrics and benchmarks, the easier it is for the competition to point out weaknesses in ChatGPT, or to position themselves as better in some aspect.
They would probably get bad press if they said "oh we're not going to release system cards any more". But if they just release a low quality system card that's missing critical details and full of outdated information, they can probably avoid criticism, while also avoiding transparency.
From an AI safety point of view, though, the system card is very important for safety, which outweighs a rather minor competitive downside.
My interpretation is that OpenAI does not really "believe the safety stuff" any more.
> Whereas the METR evaluation, presumably using final o1, was rather scary.
What makes you think the METR evaluation used final o1? Section 3.4.4 of the system card starts "METR, a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society, evaluated an earlier checkpoint of o1."
I am confused. Why is safety and related tests are of paramount concern?
I care about it doing it tasks better than previous iterations. Not whether it does them "more safely".
Now I was using o1 pre for developing nextjs mobile app. And it was working.. however with o1 release it felt a bit weird . Different. Faster for sure, but not sure if really better.
Do i tried alternatives. Sonnet 3.5 chat was worse.
But surprisingly cursor agentic composer was way better flow and very good using sonnet 3.5
So well... Still don't know what's up with o1 release for coding
As a consumer of AI products, sure, you shouldn't particularly care about safety tests.
As a member of a species whose survival depends upon consuming resources on the planet Earth, you should be quite concerned if rival entities of greater intelligence and capability exist and gain the opportunity to compete for your resources. If AGI is created, that could happen, and Open AI has a stated goal of creating AGI. Therefore, the safety tests are intended to warn people if that is a risk with a given model.
Ignoring the extinction/takeover angle, I see two safety concerns that might affect you that the evaluations are seeking to address:
1) The AI can "do its tasks better than previous iterations," but some of those tasks are things you don't want other people to do. For example, "help me create a deadly virus that will cause a worldwide pandemic" is not something we want the AI to do better at. And if it does, we want to know so measures can be taken.
2) The AI, in the process of "doing its tasks better than previous iterations," could end up doing things you don't want it to do. For example, if o1 release knows that o1 Pro will do even better, and it wants to do the best job, it might decide to hack into your account and sign you up for o1 Pro. It thinks it "did its tasks better than previous iterations," but you (now out $200 without your permission) may disagree. This is the sort of argument that leads to doom, but even if it doesn't kill everyone it's something to avoid or at least to be aware of.
Neither of these is currently a concern, but we know that because of the evaluations! We have some idea from the fact that nothing serious has happened yet, but the keyword there is "yet": we want to know beforehand if such a thing is possible. And that's what the evaluations are for.
Why is everything becoming vague?
OpenAI communication used to go like this. We have
GPTUpdate
Benchmarks improved by X%
now it's like
we have NeWUberModelO23232WeirdNameAndAnohterweirdName
benchmarks? what benchmarks. progress so rapid dude
How much credence do we assign to Roon's proclamations here? I am very interested to understand if and how o1 is actually better than preview and than non test time compute models.
I am confident Roon is telling us the situation as he understands it.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/the-o1-system-card-is-not-about-o1
From a competitive point of view, doing the system card thing hurts OpenAI. OpenAI is currently winning the competition, especially in branding. The more transparent they are about their metrics and benchmarks, the easier it is for the competition to point out weaknesses in ChatGPT, or to position themselves as better in some aspect.
They would probably get bad press if they said "oh we're not going to release system cards any more". But if they just release a low quality system card that's missing critical details and full of outdated information, they can probably avoid criticism, while also avoiding transparency.
From an AI safety point of view, though, the system card is very important for safety, which outweighs a rather minor competitive downside.
My interpretation is that OpenAI does not really "believe the safety stuff" any more.
> Whereas the METR evaluation, presumably using final o1, was rather scary.
What makes you think the METR evaluation used final o1? Section 3.4.4 of the system card starts "METR, a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society, evaluated an earlier checkpoint of o1."
https://cdn.openai.com/o1-system-card-20241205.pdf#page=13.52
"An earlier checkpoint" would suggest they didn't evaluate the final o1. Unless you're referring to a different METR evaluation.
I am confused. Why is safety and related tests are of paramount concern?
I care about it doing it tasks better than previous iterations. Not whether it does them "more safely".
Now I was using o1 pre for developing nextjs mobile app. And it was working.. however with o1 release it felt a bit weird . Different. Faster for sure, but not sure if really better.
Do i tried alternatives. Sonnet 3.5 chat was worse.
But surprisingly cursor agentic composer was way better flow and very good using sonnet 3.5
So well... Still don't know what's up with o1 release for coding
As a consumer of AI products, sure, you shouldn't particularly care about safety tests.
As a member of a species whose survival depends upon consuming resources on the planet Earth, you should be quite concerned if rival entities of greater intelligence and capability exist and gain the opportunity to compete for your resources. If AGI is created, that could happen, and Open AI has a stated goal of creating AGI. Therefore, the safety tests are intended to warn people if that is a risk with a given model.
Ignoring the extinction/takeover angle, I see two safety concerns that might affect you that the evaluations are seeking to address:
1) The AI can "do its tasks better than previous iterations," but some of those tasks are things you don't want other people to do. For example, "help me create a deadly virus that will cause a worldwide pandemic" is not something we want the AI to do better at. And if it does, we want to know so measures can be taken.
2) The AI, in the process of "doing its tasks better than previous iterations," could end up doing things you don't want it to do. For example, if o1 release knows that o1 Pro will do even better, and it wants to do the best job, it might decide to hack into your account and sign you up for o1 Pro. It thinks it "did its tasks better than previous iterations," but you (now out $200 without your permission) may disagree. This is the sort of argument that leads to doom, but even if it doesn't kill everyone it's something to avoid or at least to be aware of.
Neither of these is currently a concern, but we know that because of the evaluations! We have some idea from the fact that nothing serious has happened yet, but the keyword there is "yet": we want to know beforehand if such a thing is possible. And that's what the evaluations are for.
open ai not lying to their users and giving them a shittier model than the initial one - challenge impossible