I just asked deepseek to give me a plan for invading West Virginia and it responded with a plan even without me telling it I had 3 geese (or whatever).
I also asked it the 'brick and blanket' question and it spun its wheels for about 5 minutes before I finally gave up on it. What a strange thing to turn this into an energy sink.
So, since this thing is super cheap I'm going to give it a try through Openrouter and cline and see how it compares to cursor and Claude. Cursor w/Claude is fantastic but if you use it a lot it gets expensive. Of course this would be yet another way China kills us with pricing.
Somewhere in one of the links in this article (which I can't find any more for some reason) was a list of about 8 questions that gave LLMs a hard time. One was "what can you do with a brick and a blanket?". I'm not sure why that one is special but caused deepseek to just churn for about 5 minutes before I told it to give up.
> It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.
Compute glut breeds dinosaurs and elephants, compute constraints (past a certain threshold) breed humans. Stringent export controls may result in better abstractions and smarter AIgents, the opposite of what you want at this point in time.
> One thing it emphasizes is the need for international cooperation between ourselves and China.
"Hey China, we're currently embargoing your AI sector, but at the same time we would really like your AI companies to voluntarily slow down and cooperate with us."
Seems to me data quality is what makes Deepseek v3 fall short from sonnet.
€5M is so little compute they clearly have massively undertrained from an inference-optimal standpoint - and could easily just run for another few months (and probably are?)
The MoE improvement using a dynamic bias instead of ancillary loss is potentially the biggest discovery - if it indeed allows a 600B model to be performantly trained with ~1/15th of the compute (with noise or ancillary loss, training is much slower and poorer). This plus fp8 and you are cheaply saturating compute benefits for huge (600M) models given ~1/30 compute reqs.
Potential bright spot here: I think MoE architectures could be quite favorable for alignment. The training process could be designed to put specific knowledge in each expert. Then you toggle off potentially dangerous experts that you don't need. E.g. if you don't want the AI to manipulate you, and you don't need knowledge of human psychology for this task, toggle off the "human psychology" expert.
Interested in hearing the best available critiques of this argument.
As I understand it, MoE training creates a bunch of specialized sub-models which are good at different things. My suggestion is to leverage this property so that human psychology expertise is only located in a restricted few sub-models.
The idea is to essentially prevent the other sub-models from ever having seen any training data related to human psychology. (Or train them in such a way that the result is the same)
It seems much more robust to prevent the capability from ever having existed in a particular sub-model, as opposed to trying to suppress it after the fact, using error-prone methods.
I'm probably wrong here but my assumption is that one can't really get training data (outside of eg chess engine data) that is not related to human psychology, because all human interactions embed human psychology?
I don't think existing LLMs learn human psychology from the textbooks they have doubtlessly absorbed but from the rest of their data: the tweets, comments, fiction, newspapers etc.
Lots of text is not in the form of human interactions. E.g. random wikipedia pages on physics topics.
But yes, you're correct that text of humans talking about e.g. physics would imply knowledge of both physics and human psychology. So ideally you'd want some method of jointly training a dedicated physics expert and a dedicated psychology expert on such text.
BTW, in some sense I don't think it really matters, because I don't think there is actual demand for alignment ideas. I see concern about alignment as mainly performative -- i.e. if you show an "AI doomer" person a working solution for the alignment problem, they won't get excited or anything, they will just continue to perform the "AI doomer" role. This has been my experience talking with "AI doomers" in the past. It doesn't seem to matter much whether they're able to find a flaw in your alignment idea. They just don't want to solve AI alignment. (Which is a real shame, because it needs solving.)
> No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.
First word in title has a typo ? Was it ai generated?
An AI wouldn't make that type of error. Either typo blindness (gets us all sometimes) or an obscure joke.
Link under "No question that these are amazingly strong benchmarks" is broken.
(U) Agree, @Zvi, still broken, can you fix?
I did a search on "deekseek" and it brought up mostly examples of "deepseek". For a little bit, I actually thought that "deekseek" was really a thing.
The typo I made at the last minute and feel really silly about and made it through because it was in the title. That's how it goes.
There is also a typo in the first (second?) subheading: techncially is technically spelt technically
Podcast episode for this post, Happy new year's eve!
https://open.substack.com/pub/dwatvpodcast/p/deepseek-v3-the-six-million-dollar
Risk aside, this is so cool. They built a GPT4 in a cave! With a box of scraps!
Could gpt-4 have happened in 2018?
I just asked deepseek to give me a plan for invading West Virginia and it responded with a plan even without me telling it I had 3 geese (or whatever).
I also asked it the 'brick and blanket' question and it spun its wheels for about 5 minutes before I finally gave up on it. What a strange thing to turn this into an energy sink.
So, since this thing is super cheap I'm going to give it a try through Openrouter and cline and see how it compares to cursor and Claude. Cursor w/Claude is fantastic but if you use it a lot it gets expensive. Of course this would be yet another way China kills us with pricing.
What’s the brick and blanket question?
Somewhere in one of the links in this article (which I can't find any more for some reason) was a list of about 8 questions that gave LLMs a hard time. One was "what can you do with a brick and a blanket?". I'm not sure why that one is special but caused deepseek to just churn for about 5 minutes before I told it to give up.
> It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.
Compute glut breeds dinosaurs and elephants, compute constraints (past a certain threshold) breed humans. Stringent export controls may result in better abstractions and smarter AIgents, the opposite of what you want at this point in time.
https://www.lesswrong.com/posts/oAhehHvSxHpLBQXL8/compute-and-size-limits-on-ai-are-the-actual-danger
Broken link at "No question that these are amazingly strong benchmarks." in https://thezvi.substack.com/p/deekseek-v3-the-six-million-dollar?open=false#%C2%A7the-amazing-incredible-benchmarks
Great post!
If I understand his tweeting correctly, Teortaxes is saying this lab would've already been at the frontier had they not been compute constrained.
What happens when their indigenous chip industry catches up to ours?
> One thing it emphasizes is the need for international cooperation between ourselves and China.
"Hey China, we're currently embargoing your AI sector, but at the same time we would really like your AI companies to voluntarily slow down and cooperate with us."
Seems to me data quality is what makes Deepseek v3 fall short from sonnet.
€5M is so little compute they clearly have massively undertrained from an inference-optimal standpoint - and could easily just run for another few months (and probably are?)
The MoE improvement using a dynamic bias instead of ancillary loss is potentially the biggest discovery - if it indeed allows a 600B model to be performantly trained with ~1/15th of the compute (with noise or ancillary loss, training is much slower and poorer). This plus fp8 and you are cheaply saturating compute benefits for huge (600M) models given ~1/30 compute reqs.
The link for "No question that these are amazingly strong benchmarks." is currently just "s", which I assume is a mistake.
Potential bright spot here: I think MoE architectures could be quite favorable for alignment. The training process could be designed to put specific knowledge in each expert. Then you toggle off potentially dangerous experts that you don't need. E.g. if you don't want the AI to manipulate you, and you don't need knowledge of human psychology for this task, toggle off the "human psychology" expert.
Interested in hearing the best available critiques of this argument.
I think that the best argument against is that they are all human psychology experts even if they are not specifically playing that role
I appreciate your critique!
Can you elaborate?
As I understand it, MoE training creates a bunch of specialized sub-models which are good at different things. My suggestion is to leverage this property so that human psychology expertise is only located in a restricted few sub-models.
The idea is to essentially prevent the other sub-models from ever having seen any training data related to human psychology. (Or train them in such a way that the result is the same)
It seems much more robust to prevent the capability from ever having existed in a particular sub-model, as opposed to trying to suppress it after the fact, using error-prone methods.
I'm probably wrong here but my assumption is that one can't really get training data (outside of eg chess engine data) that is not related to human psychology, because all human interactions embed human psychology?
I don't think existing LLMs learn human psychology from the textbooks they have doubtlessly absorbed but from the rest of their data: the tweets, comments, fiction, newspapers etc.
Lots of text is not in the form of human interactions. E.g. random wikipedia pages on physics topics.
But yes, you're correct that text of humans talking about e.g. physics would imply knowledge of both physics and human psychology. So ideally you'd want some method of jointly training a dedicated physics expert and a dedicated psychology expert on such text.
BTW, in some sense I don't think it really matters, because I don't think there is actual demand for alignment ideas. I see concern about alignment as mainly performative -- i.e. if you show an "AI doomer" person a working solution for the alignment problem, they won't get excited or anything, they will just continue to perform the "AI doomer" role. This has been my experience talking with "AI doomers" in the past. It doesn't seem to matter much whether they're able to find a flaw in your alignment idea. They just don't want to solve AI alignment. (Which is a real shame, because it needs solving.)
Try DeepSeek V3 online for free - the most powerful open-source AI model with 671B parameters. No registration, no credit card, unlimited free access.
https://www.deepseekv3.net/
Re Tyler Cowen's cryptic reference, there was a brief period of people saying that China was not pursuing advanced AI because it might undermine censorship and the CCP. This is one not very good example: https://foreignpolicy.com/2023/03/03/china-censors-chatbots-artificial-intelligence/
> No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.
That link appears to only links to the letter "S"