I just asked deepseek to give me a plan for invading West Virginia and it responded with a plan even without me telling it I had 3 geese (or whatever).
I also asked it the 'brick and blanket' question and it spun its wheels for about 5 minutes before I finally gave up on it. What a strange thing to turn this into an energy sink.
So, since this thing is super cheap I'm going to give it a try through Openrouter and cline and see how it compares to cursor and Claude. Cursor w/Claude is fantastic but if you use it a lot it gets expensive. Of course this would be yet another way China kills us with pricing.
Somewhere in one of the links in this article (which I can't find any more for some reason) was a list of about 8 questions that gave LLMs a hard time. One was "what can you do with a brick and a blanket?". I'm not sure why that one is special but caused deepseek to just churn for about 5 minutes before I told it to give up.
> It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.
Compute glut breeds dinosaurs and elephants, compute constraints (past a certain threshold) breed humans. Stringent export controls may result in better abstractions and smarter AIgents, the opposite of what you want at this point in time.
> One thing it emphasizes is the need for international cooperation between ourselves and China.
"Hey China, we're currently embargoing your AI sector, but at the same time we would really like your AI companies to voluntarily slow down and cooperate with us."
Seems to me data quality is what makes Deepseek v3 fall short from sonnet.
€5M is so little compute they clearly have massively undertrained from an inference-optimal standpoint - and could easily just run for another few months (and probably are?)
The MoE improvement using a dynamic bias instead of ancillary loss is potentially the biggest discovery - if it indeed allows a 600B model to be performantly trained with ~1/15th of the compute (with noise or ancillary loss, training is much slower and poorer). This plus fp8 and you are cheaply saturating compute benefits for huge (600M) models given ~1/30 compute reqs.
Potential bright spot here: I think MoE architectures could be quite favorable for alignment. The training process could be designed to put specific knowledge in each expert. Then you toggle off potentially dangerous experts that you don't need. E.g. if you don't want the AI to manipulate you, and you don't need knowledge of human psychology for this task, toggle off the "human psychology" expert.
Interested in hearing the best available critiques of this argument.
First word in title has a typo ? Was it ai generated?
An AI wouldn't make that type of error. Either typo blindness (gets us all sometimes) or an obscure joke.
Link under "No question that these are amazingly strong benchmarks" is broken.
I did a search on "deekseek" and it brought up mostly examples of "deepseek". For a little bit, I actually thought that "deekseek" was really a thing.
The typo I made at the last minute and feel really silly about and made it through because it was in the title. That's how it goes.
Podcast episode for this post, Happy new year's eve!
https://open.substack.com/pub/dwatvpodcast/p/deepseek-v3-the-six-million-dollar
Risk aside, this is so cool. They built a GPT4 in a cave! With a box of scraps!
Could gpt-4 have happened in 2018?
I just asked deepseek to give me a plan for invading West Virginia and it responded with a plan even without me telling it I had 3 geese (or whatever).
I also asked it the 'brick and blanket' question and it spun its wheels for about 5 minutes before I finally gave up on it. What a strange thing to turn this into an energy sink.
So, since this thing is super cheap I'm going to give it a try through Openrouter and cline and see how it compares to cursor and Claude. Cursor w/Claude is fantastic but if you use it a lot it gets expensive. Of course this would be yet another way China kills us with pricing.
What’s the brick and blanket question?
Somewhere in one of the links in this article (which I can't find any more for some reason) was a list of about 8 questions that gave LLMs a hard time. One was "what can you do with a brick and a blanket?". I'm not sure why that one is special but caused deepseek to just churn for about 5 minutes before I told it to give up.
> It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.
Compute glut breeds dinosaurs and elephants, compute constraints (past a certain threshold) breed humans. Stringent export controls may result in better abstractions and smarter AIgents, the opposite of what you want at this point in time.
https://www.lesswrong.com/posts/oAhehHvSxHpLBQXL8/compute-and-size-limits-on-ai-are-the-actual-danger
Broken link at "No question that these are amazingly strong benchmarks." in https://thezvi.substack.com/p/deekseek-v3-the-six-million-dollar?open=false#%C2%A7the-amazing-incredible-benchmarks
Great post!
If I understand his tweeting correctly, Teortaxes is saying this lab would've already been at the frontier had they not been compute constrained.
What happens when their indigenous chip industry catches up to ours?
> One thing it emphasizes is the need for international cooperation between ourselves and China.
"Hey China, we're currently embargoing your AI sector, but at the same time we would really like your AI companies to voluntarily slow down and cooperate with us."
Seems to me data quality is what makes Deepseek v3 fall short from sonnet.
€5M is so little compute they clearly have massively undertrained from an inference-optimal standpoint - and could easily just run for another few months (and probably are?)
The MoE improvement using a dynamic bias instead of ancillary loss is potentially the biggest discovery - if it indeed allows a 600B model to be performantly trained with ~1/15th of the compute (with noise or ancillary loss, training is much slower and poorer). This plus fp8 and you are cheaply saturating compute benefits for huge (600M) models given ~1/30 compute reqs.
The link for "No question that these are amazingly strong benchmarks." is currently just "s", which I assume is a mistake.
Potential bright spot here: I think MoE architectures could be quite favorable for alignment. The training process could be designed to put specific knowledge in each expert. Then you toggle off potentially dangerous experts that you don't need. E.g. if you don't want the AI to manipulate you, and you don't need knowledge of human psychology for this task, toggle off the "human psychology" expert.
Interested in hearing the best available critiques of this argument.
Try DeepSeek V3 online for free - the most powerful open-source AI model with 671B parameters. No registration, no credit card, unlimited free access.
https://www.deepseekv3.net/