AI #96: o3 But Not Yet For Thee

Dec 26, 2024

The year in models certainly finished off with a bang.

13 Comments

Full podcast episode for this post, ironically ChatGPT was down for a bunch of the time I was making this, making my normal flow a lot harder. https://open.substack.com/pub/dwatvpodcast/p/ai-96-o3-but-not-yet-for-thee

Expand full comment

Bolton

Dec 26

You say the "smallest integer whose square is between 15 and 30" one replicated for you, but it didn't did it? The answer you got was "4", which is different from "-5", which is what the linked tweet thread showed.

Seems like there is a minor ambiguity in the question about whether "smaller" means "lesser" or "lesser in magnitude". Perhaps this is somehow related to this question being hard for LLMs to answer.

Expand full comment

Victualis

Dec 27

Revision control: doesn't everyone do git diff and/or git status before committing? Oh right, they don't, because nobody checks git commit history to audit who pushed the expensive bugs, and nobody got fired for littering the repo with garbage commits.

Expand full comment

Victualis

Dec 27

DeepSeek v2/3 show that the early OpenAI architectural choices (which everyone else has been blindly following) are seriously suboptimal. Uninformed matrix multiplication is not all you need, unless you want to lock a nasty bottleneck into inference.

Expand full comment

Josh Levent

Dec 27

> In theory they could both be telling the truth - perhaps the Biden administration told Andreessen about this insane plan directly, despite telling him being deeply stupid, and also hid it from Altman despite that also then being deeply stupid - but mostly, yeah, at least one of them is almost certainly lying.

I assume it’s a mix of both of them exaggerating / reading what they expect in ambiguous context. Something like this wouldn’t be put explicitly and Andreessen knows that. He got some backchannel communication that he interpreted that way which may or may not have been meant that way and which may have been an independent thought by an ambitious nobody or a feeler from Biden himself (seems implausible but who knows). For his part, Altman has been fighting regulation very successfully, and if he ever got a backchannel hinting at something like this he probably wrote it off as someone stupid dreaming independently.

Expand full comment

Anthony Bailey

Dec 27

> voice modes should show transcribed text, and also show a transcript after, and also show what the AI is saying, there is no reason to not do these things.

I think there may be some reason, if mixed modality can complicate things. If the input and output tokens in the interaction were voice, not text, then there is translation involved in showing text. It's great translation leaning on wonderful shared embeddings, but I imagine it's possible to say one thing and write another.

Expand full comment

tup99

Dec 28

Zvi says (https://thezvi.substack.com/p/ai-practical-advice-for-the-worried) that it's morally wrong to work at an AI lab, and thus help push forward the AI frontier, if you are worried about AI doom (which I am).

After watching the Anthropic founders' video, where they seem genuinely trying to be in the AI race in order to *prevent* AI doom, I'm curious: Do you think this advice would apply to Anthropic? Is working for them a force for good?

Expand full comment

Reply (1)

Taleuntum

Dec 30

Personally, I think it's currently a force of good, but one should only work there if they are able to adapt to changing moral situation even if said adapting is personally costly.

Expand full comment

Yohans

Dec 28Edited

Long time reader first time commenter. In this post, you discuss the tech job market being bad and the possibility of ai soon taking all of our jobs. Yet you also say that we need more skilled foreign labor via more visas. Is this a contradiction? Seems that way to me.

Expand full comment

kajota

Dec 28

"I once pushed to production without being careful enough and realized a few hours later I had removed an entire feature"

I recently spent a few days working with Cursor to create a website for inventorying my record collection. I wanted a tailored version of Discogs without the clutter, slowness, and restrictive rules about what can be added to the site. This project also served as an opportunity to explore Cursor's capabilities.

Cursor created an impressive website, far better than what I could have built with my limited website coding skills. However, I encountered similar issues during the process. Sometimes, when adding new features, Cursor would inadvertently remove existing features without any apparent connection. I wouldn't notice these unintended changes until I tried to use a feature that should have been unrelated to my recent modifications.

Cursor also introduced a bug that randomly deleted tracks from the tracks table in the SQLite database powering the site. Fortunately, I had Cursor to committing changes regularly to GitLab, although the database itself wasn't included in the repository. To recover the lost data, I had to resort to an old manual backup of the database.

I suspect that these issues might be Cursor's way of indirectly encouraging me (to use Cursor) to write and run unit tests to catch such problems early on.

Despite these issues, I was fairly amazed that Cursor managed to create a really good website in just three days – something I could have accomplished on my own but it would have looked like something on Geocities in the 90s. And I didn't write any code. I just told it want I wanted. I used it as an English -> node.js translating device.

Expand full comment

Lucas

Dec 28

In MB vs. EY, I think EY is in the wrong here. I would play devil's advocate and argue that in this ever-evolving AI-fueled landscape, the time investment you are iterating on is overstated. Why is finding some prompt or custom instructions deemed so difficult? Maybe it is overlooked; one does not need to change writing styles but find a model, system, or person to iterate and polish work, i.e. translating current style to desired form.

Hmu EY this is really not that difficult. Curious what you would even be optimizing for...

Expand full comment

dan mantena

Dec 28

eagerly waiting on that o3 model post.

what is your probability estimate of these eval performance will actually translate to real world knowledge work task competency?

the current models seem to leave a lot to be desired on this front...

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

https://arxiv.org/abs/2412.14161

Expand full comment

dan mantena

Dec 28

I have tried "LLM orchestration spaghetti" and it has crated a miserable level of admin work that i would not wish on other people. is this a normal practice from y'all?

Expand full comment

Don't Worry About the Vase

AI #96: o3 But Not Yet For Thee