It's pretty obvious why qwen finetunes reign supreme: the models are much smaller and hence easier to fine tune than Deepseek.
Even the top of the line Qwen3 is just over a third the size of deepseek (235 vs 671 billion params, I'll ignore the 480b coder thingy because that is a bit of weird case). And the 4-30b ones are clearly at the open weight frontier for their respective sizes.
FWIW, I lean towards GLM 4.6 being the top open model these days. That one is just over half the size of DS3 - there are others which beat it in some areas though. For one, Kimi K2 is certainly the best writer among the open weight models.
I find it frustrating that benchmarks aren't telling us more. GPQA Diamond is up to 87.7% ( https://artificialanalysis.ai/evaluations/gpqa-diamond ), is nominally a test of "Ph.D.-level" STEM skill, and yet a) I still see substantial errors on my college-senior-level benchmark-ette and b) AI Scientist projects, while coming along nicely, are not as far along as the benchmark scores would suggest.
The 2025 and 2026 predictions are interesting, though I'm having trouble trying to translate what they mean for RSI-readiness, AI-2027-style supercoder readiness, or what we should expect in the way of an intelligence explosion.
Urgent and substantial fundraising is critical for AI safety research, and optimizing the fundraising message and language is key to maximizing return on investment (ROI). Many people, including me, are terrified of a bad outcome. So, consider when...
Targeting AI Technology/High-Level Applied Mathematician Donors, who are concerned about AI safety, philanthropic appeals would be framed around data-driven outcomes, such as facilitating the continued existence of humanity on the macro level, and the ability to say pleasantries like “G'day, mate” among many other things on the micro level. Rather than asking for a one-size-fits-all amount, provide specific, savvy denomination tiers that tie directly to distinct cost underwriting...
Targeting outside of the industry, High-Financial-Net-Worth Donors, who are concerned about AI safety, philanthropic appeals would be framed around making strategic investments to protect the value of all their other assets, as unchecked AI poses urgent, existential risks. Emphasize infrastructure funding, not just the research. Position the donation as building critical "AI safety infrastructure," such as upskilling programs for AI engineers, creating transparency and audit protocols, and establishing oversight institutions. This moves the “ask” from the less well understood and unfortunately underappreciated abstract research to tangible, institutional outcomes...
Targeting a Broader General Audience Donors (able to contribute smaller financial contributions), who are concerned about AI Safety, philanthropic appeals would be framed around emotional storytelling, clear, direct calls to action, emphasizing the immediate impact of even small donations with specific cost underwriting examples, and making the donation process as easy as possible through one or multiple channels...
I think Suck Up (https://www.playsuckup.com/) might have had the potential to take off but its biggest problem was that it wasn't on Steam. There was a Chinese game called Whispers from the Star (https://store.steampowered.com/app/3730100) that did influencer rounds somewhat recently but was definitely far from great. There's been a couple others but yeah they're definitely struggling to get attention and I agree that we're unlikely to get there through 2026.
What would you point to as recent “vibe shifts”. Could be interpreted a few different ways and since that last paragraph rang true for you I wonder how you understood it.
Hey Zvi, I’m curious what prompt you use for eliciting these probabilistic estimates out of models? Do you just ask for a probability or is there some forecasting-fu?
It's pretty obvious why qwen finetunes reign supreme: the models are much smaller and hence easier to fine tune than Deepseek.
Even the top of the line Qwen3 is just over a third the size of deepseek (235 vs 671 billion params, I'll ignore the 480b coder thingy because that is a bit of weird case). And the 4-30b ones are clearly at the open weight frontier for their respective sizes.
FWIW, I lean towards GLM 4.6 being the top open model these days. That one is just over half the size of DS3 - there are others which beat it in some areas though. For one, Kimi K2 is certainly the best writer among the open weight models.
Many Thanks! Interesting report!
I find it frustrating that benchmarks aren't telling us more. GPQA Diamond is up to 87.7% ( https://artificialanalysis.ai/evaluations/gpqa-diamond ), is nominally a test of "Ph.D.-level" STEM skill, and yet a) I still see substantial errors on my college-senior-level benchmark-ette and b) AI Scientist projects, while coming along nicely, are not as far along as the benchmark scores would suggest.
The 2025 and 2026 predictions are interesting, though I'm having trouble trying to translate what they mean for RSI-readiness, AI-2027-style supercoder readiness, or what we should expect in the way of an intelligence explosion.
Urgent and substantial fundraising is critical for AI safety research, and optimizing the fundraising message and language is key to maximizing return on investment (ROI). Many people, including me, are terrified of a bad outcome. So, consider when...
Targeting AI Technology/High-Level Applied Mathematician Donors, who are concerned about AI safety, philanthropic appeals would be framed around data-driven outcomes, such as facilitating the continued existence of humanity on the macro level, and the ability to say pleasantries like “G'day, mate” among many other things on the micro level. Rather than asking for a one-size-fits-all amount, provide specific, savvy denomination tiers that tie directly to distinct cost underwriting...
Targeting outside of the industry, High-Financial-Net-Worth Donors, who are concerned about AI safety, philanthropic appeals would be framed around making strategic investments to protect the value of all their other assets, as unchecked AI poses urgent, existential risks. Emphasize infrastructure funding, not just the research. Position the donation as building critical "AI safety infrastructure," such as upskilling programs for AI engineers, creating transparency and audit protocols, and establishing oversight institutions. This moves the “ask” from the less well understood and unfortunately underappreciated abstract research to tangible, institutional outcomes...
Targeting a Broader General Audience Donors (able to contribute smaller financial contributions), who are concerned about AI Safety, philanthropic appeals would be framed around emotional storytelling, clear, direct calls to action, emphasizing the immediate impact of even small donations with specific cost underwriting examples, and making the donation process as easy as possible through one or multiple channels...
(sourced from Google AI, with tweaks)
I think Suck Up (https://www.playsuckup.com/) might have had the potential to take off but its biggest problem was that it wasn't on Steam. There was a Chinese game called Whispers from the Star (https://store.steampowered.com/app/3730100) that did influencer rounds somewhat recently but was definitely far from great. There's been a couple others but yeah they're definitely struggling to get attention and I agree that we're unlikely to get there through 2026.
Penultimate and last paragraphs are exactly right on all counts, I think.
Ah well, I have three stents, and should have died ten years ago from emphysema like my grandfather. It was a good run. Après nous, le déluge.
What would you point to as recent “vibe shifts”. Could be interpreted a few different ways and since that last paragraph rang true for you I wonder how you understood it.
Hey Zvi, I’m curious what prompt you use for eliciting these probabilistic estimates out of models? Do you just ask for a probability or is there some forecasting-fu?