17 Comments

It's definitely a cool model. Gives me hope that RL can help with mode collapse and lost creativity (the single big unsolved issue with AI right now, in my view).

I want to see how it does on "pick a random number"-style tests that expose an LLM's creativity/lack thereof.

>I’m with Eliezer here. That’s still slop.

I agree with both of you, but it still feels like progress in a sense.

I heard someone say "r1 has horrible taste, while other models have no taste at all" which feels directionally pointed at the truth. And would be better at than what we had before. If you have the capacity for bad taste, you might someday have good taste...

Expand full comment

I'm going to take the other side and say that while much of it is slop, including 100% of the first paragraph, paragraph 2 and the first three sentence of paragraph 4 are non-slop. "The act of watering a dying plant on a Wednesday" is actually a genuinely striking (if clichéd) turn of phrase.

Expand full comment

Correction to this:

> The one catch is that with r1 you do have to pay for the <thinking> tokens. I asked r1 to estimate what percentage of tokens are in the CoT, and it estimated 60%-80%, with more complex tasks using relatively more CoT tokens, in an answer that was roughly 75% within the CoT. If you only care about the final output, then that means this is more like 10 times cheaper than o1 rather than 30 times cheaper. So it depends on whether you’re making use of the CoT tokens.

30x cheaper is correct, because you have to pay for o1's <thinking> tokens too. If you read the fine print on OpenAI's pricing page (https://openai.com/api/pricing/), they say:

> Output tokens include internal reasoning tokens generated by the model that are not visible in API responses.

Expand full comment

Then how is it that expensive wtf is happening in there

Expand full comment

first attempt at profit-taking on a moat from openai?

Expand full comment

o1 output tokens cost $60 / million, which is the same as the original GPT-4 model, right? Maybe o1 is just not distilled down the way newer models like 4o are? But, hmm, I thought the assumption was that o1 was based on 4o, which undercuts this idea.

o1-mini output tokens are $12 / million, vs. $10 / million for gpt-4o, but only $0.60 for gpt-4o-mini. Does anyone know how general knowledge in o1 and o1-mini compares to 4o and 4o-mini? That might be a hint to the model sizes.

One other data point. OpenAI claims to be losing money on ChatGPT Pro. At $200 / month, and supposing that the o1 API price is currently breakeven, losing money on Pro subscriptions would imply average usage of about 100,000 o1 output tokens per day. That's a lot of usage on a daily average basis, even when including thinking tokens and taking into account these are users who have chosen to opt in to the $200/month option. Of course Pro gets you o1-pro, which is presumably much more expensive to operate, but even so: if they really are losing money on Pro, I'm inclined to guess that these API prices don't include a big profit margin. (Also their past pricing behavior doesn't seem to include big profit margins.)

There's been speculation that o1 employs some sort of search process but my understanding is that this is now generally believed to not be the case (and we know that r1 gets in-the-ballpark performance without search). Hard to think where $$$ could be going if not to search, larger model size, or profit margin.

...wait, one other thought: o1 outputs will presumably tend to be much longer than 4o outputs, meaning the quadratic attention cost will be higher, maybe that plays a role?

Definitely confusing!

Expand full comment

Random question - has zvi made predictions about the role of money assuming !p doom? I am not living in the us, kinda broke and am worried about the implications of it even if the worst outcome doesn't happen.

Expand full comment

I think in most of those worlds it's not too bad to be ~broke in absolute terms, probably better than being ~broke now, but not all of them?

Expand full comment

Looking forward to your thoughts on Project Stargate,* Zvi.

*Yes, I too cannot believe they called it that. My main thought was "why the hell did they name it after the CIA's bonkers investigation of remote viewing and psychic phenomena?" [https://en.wikipedia.org/wiki/Stargate_Project]

Expand full comment

The fiction is bad in the sense that I would never choose to read it myself, but there are 100% people getting published in the genre- and licensed-fiction space who write at this level or below.

Expand full comment

So. Is anyone still interested in writing a $500 billion check to fund OpenAI after learning about DeepSeek? Doesn't this solidly demonstrate OpenAI has no moat? And possibly even that huge compute costs aren't really necessary in the first place?

(I realize we don't see the counterfactual potentially other 999 labs that are DeepSeek's size that failed and never published, but still)

Expand full comment

You have to assume that since Deepseek told everyone all the tricks, the companies with the most resources will adopt them - sometimes hiring former Deepseek employees for the parts not actually published - and in the medium to long run the richest shall prevail as always.

For example in the GPU wars if a company found a way to render frames for 1/10 the compute (and there were many such innovations), it's not going to dethrone Nvidia/AMD. Both will buy out and adopt.

I think the way you should see it is a 500B check suddenly got 10x-30x higher ROI without needing to speculate about algorithm advances. The same compute budget that would allow for 100 million parallel agents allows for 3 billion.

The Singularity is potentially 7-8 years sooner.

Expand full comment

That assumes OAI only cares about training compute. Keep in mind, under their vision, they train AGI (perhaps for relatively cheap) then figure out a way to serve it up for inference to the entire world. DeepSeek might have difficulty scaling up beyond a limited number of customers for inference.

Expand full comment

> The cloud-layer censorship makes sense. Remember Kolmogorov Complicity and the Parable of the Lightning. If you force the model to believe a false thing, that is going to cause cascading problems elsewhere.

This doesn't follow though? You need to model to believe "I must not discuss X" not "not-X". And the whole point of Kolmogorov Complicity is that "X" plus "I must not discuss X" is a consistent (but maybe undesirable) world-model for a truth-seeker to hold.

Put it another way: arguendo, generating good LLM content for "not-X" is not an option, so your goal as the censor is to generate refusals to discuss not spurious content. And you can do that just as easily with RLHF as with post-processing. Western LLMs do it all the time.

The more likely reason this censorship isn't done with RLHF is the opposite: RLHF isn't good enough at changing the model. Some Chinese Pliny will get it to talk about Tianamen, so you have to implement the layer on top anyway. And as long as you've done it right (which it seems Deepseek hasn't) you might as well _only_ implement that and not waste energy RLHFing it.

Expand full comment

'this also tentatively updates me towards this process not scaling that far purely with additional compute? We are seeing the model roughly ‘think like a regular person’ using reasoning techniques within the training distribution in ways you’d expect to commonly find'

Disagree. I don't really even buy the 'next token prediction on human generated data is capped at human level', but even if I did, I wouldn't expect that to apply to the RL approach here. Eg look at figure 3 in the paper. That's not measuring capability directly, but it's measuring how long a chain of thought the model can productively use , and how that grows with further training. That's not a curve that shows any sign of leveling off anytime soon. Ideally, we'd have a figure directly measuring capabilities over training (and we do have one in figure 2, but that curve levels off only because it's saturating the benchmark and so each factor of 10 in accuracy makes a smaller and smaller absolute difference as the model approaches 100%). I expect that if they showed the comparable figure for a non-saturated benchmark, we'd see linear increase continue for the duration of training.

Expand full comment

One of your links (to Angelica) leads to 0.0.0.6. Can you fix?

Expand full comment

typo: "If you decentralize of power" twice

Expand full comment