22 Comments
User's avatar
Jeff Mohl's avatar

I do think there is an important underlying question here, which is something like "can an LLM instantiate a mechanistic world model?" which seems very hard to answer and also pretty critical. I'd differentiate a world model from something that essentially boils down to a massive list of if-than statements, which can get to similar results in some settings but characteristically fails badly when taken out of the training distribution.

Some kind of world model is more or less required for what we'd like to call "thinking", and it's possible but not obvious that you can build one on top of language. In humans/animals this is very clearly not the case; the world model came first and language is a compressed translation of it. Some other approaches in the past (like running reinforcement learning on games or in virtual environments) have had pretty disappointing results getting to these kinds of internal models.

Getting an answer to this question, in my view, will tell us whether there is some fundamental capability limit to LLMs or whether they can just be scaled forever. So papers that try to do that (which this one could have been) seem very important. It's disappointing when it instead boils down to just trying to dunk on the out-group.

Expand full comment
John's avatar

The Othello world model paper was what convinced me that LLMs can indeed form world models. This blog post has a really good analogy (section "A thought experiment") that gives intuitions on why that paper's findings go beyond simply memorizing if-then statements. https://thegradient.pub/othello/

What's maybe a better and unanswered question is how much training, and in what situations, do you get a model that actually does form a world model instead of defaulting to a memorized pattern.

Expand full comment
Sinity's avatar

> Or more generally, ‘won’t’ or ‘doesn’t’ [X] does not show ‘can’t’ [X].

"sampling can prove the presence of knowledge but not the absence" ~Gwern

Expand full comment
M Flood's avatar

Inside Apple:

Cast:

Tim (Cook, CEO)

John (Giannandrea, Senior Vice President of Machine Learning & AI Strategy)

Scene: Tim Cook's office, mid-morning

Tim: John, my man, how's it going?"

John: It's a'ight, Timbo, a'ight

Tim: How's the new Siri coming along?

John: ... (total silence)

Tim: Fuuuccckkkkk...

John: Look, dude, it's like ... it's tough out there okay.

Tim: You mean we got nothin'?

John: Not nothin' ... how do you feel about super duper autocorrect from on device 3 B model?

Tim: ... (total silence)

John: Or like get this ... create your own emoji - Genmoji, get it - with diffusion. Like Midjourney, but not as good.

Tim (forehead on desk): Fuck, fuck, fuck ...

John: My dude, don't give up ... we've got this study

Tim (sitting up again): What study?

John: Well, like, it's not like OpenAI or DeepMind-like research but ...

Tim: But what?

John: So Samy (Bengio) and the guys got to thinking ... like, what if we, like, asked LLMs to do things that even a human couldn't do, like a 50,000 step puzzle? And then the LLMs failed?

Tim: And they failed?

John: Yeah. If you don't let them write programs like a human would to solve an equivalent problem ...

Tim: Skip that part. Here's the headline "LLMs can't reason, they only pretend to reason." Get that out today.

Expand full comment
Sinity's avatar

system2 would be more descriptive than reasoning, maybe.

Expand full comment
Tim Dingman's avatar

I was shocked how many people in the industry ran with this one. The fine-grained control on complexity is nice but the rest of the results were nothing

Expand full comment
Alex's avatar

This is an interesting example of how illegible the academic publishing world can be to outsiders. As someone who reads a lot of papers, there are some signals that jump out to me, but probably don't seem significant to a typical journalist or "AI fan" on X:

- First, "this paper is from Apple" sends a different signal about credibility to an ML researcher than it does to the general public.

- The lead author is a grad student intern. There are some more senior researchers in the author list, but based on the "equal contribution" asterisk, we are probably looking at an intern project by a grad student researcher and their intern host. This is a signal about the amount of investigative effort and resources behind the paper.

- The meta in ML publishing is to make maximalist claims about the impact and significance of your work. Conference reviewers will typically make you tone it down or qualify your claims, but there is still quite a lot of it in the final papers (which are mostly irrelevant because everyone read your preprint anyway). Everyone has adjusted to expect this and just gloss over those parts when reading papers. This is a preprint, so it hasn't even had an external review process.

If you flip through preprints, most of the big companies put out dozens of papers like this every month - speculative results by small teams, which don't necessarily align with the "beliefs" or goals of the larger company. I think that's mostly a good thing - a good research culture requires making a lot of small bets, most of which won't pan out. But it can be a PR headache when "a grad student intern at Apple posted a preprint about how LLMs behave on puzzles that require long context" is perceived as "Apple says reasoning models don't really work".

Expand full comment
Jeffrey Soreff's avatar

"- The meta in ML publishing is to make maximalist claims about the impact and significance of your work. Conference reviewers will typically make you tone it down or qualify your claims, but there is still quite a lot of it in the final papers (which are mostly irrelevant because everyone read your preprint anyway). Everyone has adjusted to expect this and just gloss over those parts when reading papers."

Oww, Oww, OWW, *OWW*

I've grown to expect press releases to be over-hyped, overly strong claims above and beyond what the actual research work showed, and headlines derived from the press releases are even worse - but to have maximalist claims in the primary literature itself - *OUCH*

Expand full comment
Methos5000's avatar

I'd be more impressed with this line of thinking if you applied it to businesses selling their latest and greatest widget with equal skepticism. You talk about the incentive to make maximalist claims to get published, but what about the incentive to make maximalist claims to get more investments in their company in order to try to become stupidly rich? That's a far more lucrative incentive to over hype things. Instead it ends up sounding a lot like you are looking for a reason to not believe this because you don't like it.

Expand full comment
Tyler Corderman's avatar

Thank you.

Expand full comment
rxc's avatar

My first trial of an AI was with ChatGPT about 6 months ago, when I asked it for advice about sailing my sailboat from Florida to NYC. This is something I have actually done several times, and I thought that I would see if an AI could provide useful hints, tips, and suggestions. I was very careful to use the word "sailboat" several times, and explain how my wife and I like to make the trip. Most of the time in the Intracoastal waterway, but with a few excursions offshore. I did not try to trick it or describe the issue with any tricky words or phrases.

The first response started with a recommendation to head SOUTH from my area (Stuart, Fl), to the keys and Key West, and then to turn around and head north. I asked why it suggested that, and it apologized that maybe I did not want to do the entire ICW from end to end, and it recommended that I get on I-95 (as in Interstate 95), and head north from my town towards NYC. At this point I gave up.

I came back to it a few months later, and asked the exact same questions, but this time I got more of a travelogue, very general, with almost no nautical information or tips other than general safety recommendations. It also suggested that I start by heading south towards Miami.

I haven't run the questions again, because I would not trust anything that any AI would say about this trip. There was no recognition of the word "sailboat" which should immediately conjure thoughts about (1) something floating in the water, (2) with a mast that sticks up in the air, (3) that needs a minimum amount of water depth to float, and (4) questions about the ability of the crew and the boat to go out into the ocean or deal with traffic and issues along the way. None of these factors were considered. When people ask me about making this trip, these are the very first ones that come up in the discussion, sometimes by my, but also by the people who ask.

An AI that does not know to ask these questions cannot, under any circumstances, be considered to be providing any "reasoned advice" about the subject. If AI is going to be useful, it MUST be able to understand the subject matter and develop a reasoned understanding of what is being asked, before it makes any recommendations. Otherwise, it is as useful as an encyclopedia - it will contain lots of knowledge, but absolutely no wisdom about any of that knowledge or how to apply it.

Expand full comment
Jonathan Woodward's avatar

Just out of curiosity - which ChatGPT were you using? The publicly available free version without an account is substantially weaker than the models you have access to with an account, so that might have been one contributor to the issue.

Expand full comment
rxc's avatar

I don't know the version. Although I have been a computer programmer for nearly 60 years, and started in a university that where the first elements of AI took hold, I just watched the various comments about it. When the fad seemed to have taken hold I decided to try it out with something simple (to me). I would have to find and look thru the screenshots I took, because the model would not let me save a transcript of the session. That inability to save a transcript was also an indication to me that the AI was not ready for any real use, yet.

I am not a technophobe. I spent my career in the nuclear energy area, where computer models are a very important part. I think my early experience with computers provided me with an important basis for the rest of my career. I am an engineer, and like machines. But I cringe when I see all the hype about using AI in nuclear engineering.

I saw one blurb from Argonne National Laboratory a few weeks ago touting some new AI capability that they had created which would be used to "design, construct, and operate future nuclear power plants". My first thought was that the national laboratory where the very first man-made reactor was built had lost its collective mind, if they were pushing stuff like this. AIs may have SOME sort of role in a nuclear revival, but if I was still working at the NRC I would NEVER allow the sort of involvement that I see being touted. I was once asked to approve a scheme (and I use that word deliberately) to replace a test facility with a computer simulation. The company (Very Large) had sold the test rig and needed something to replace it to validate a different computer model. Their solution was to use one type of computer model to provide "data" to a different type of model to create a correlation. I laughed in their faces and told them to leave and never come back with this idea ever again.

The same company later tried to use a different computer model, with a probabilistic component, to repair a part inside a reactor that had broken. They did not want to replace the broken part. I was not responsible for approving this repair, but I told the people who were responsible that the computer model was speculative, and I would be surprised if it worked, because they did not have enough data or understanding of the physical phenomena which caused the failure. Sure enough, the repairs were made, they started up, and 6 months later they took it apart again, because it had broken. This continued for one more round of computer modeling before they gave up and spent the money to re-design a new component, build it, and install it.

I also had to deal with probabilistic risk assessments of power plants. People who worked for me had to calculate the consequences of various failure scenarios using thermal-hydraulic computer models that had been validated against actual physical tests and real events, and the PRA people wanted to know the consequences if a different set of events were to happen. I had an epiphany when I was asked for a calculation of a scenario that no one had ever calculated before, but which we were pretty sure was benign. I asked the analyst to do the calc, and he looked at me and asked "What answer do you want me to get?" I never expected this person to ever ask that sort of question, so I told him to calculate what he really thought would happen. It turned out to be benign, but the question itself made me realize how malleable our technical bases for safety are. PRAs are the current fad for establishing nuclear safety, and they are founded on hard technical calculations and speculation about human errors. I have no idea what an AI will do with the uncertainties associated with the technical calculations AND the uncertainties associated with the PRAs themselves, AND the uncertainties associated with the human performance aspects, which have been the actual initiating cause of ALL of the "interesting accidents" that have occurred in the nuclear industry for the past 80 years.

Human errors are the greatest unknown in doing PRA calculations, and PRA people acknowledge that they CANNOT calculate many human errors. They especially cannot cannot calculate "cognitive errors of commission" that occur between the ears of human beings. We can speculate about what a person might do wrong, but will drive ourselves crazy trying to calculate ALL of the possible errors that humans MIGHT be able to make. And the current AIs don't seem to understand what the risks associated with sailing a sailboat are, so I have no hope whatsoever that they will ever be able to predict the next nuclear "interesting event".

Sorry for the long rant.

Expand full comment
Benjamin's avatar

It seemed to give me decent answers (I was using Claude Sonnet 4 with extended thinking). For great answers I would use Opus with a modified system prompt/style or the deep research feature. For ChatGPT it would be O3 or Deep research.

It E.g. gave me a list of places where I could land which seemed to be correct and then when I asked adjusted the GPS coordinates to be in the ocean instead.

Expand full comment
rxc's avatar
Jun 11Edited

Your answer is exactly what I would fear. You have no knowledge of what is being asked, or what is needed to make it happen, but you blithely accept it. A "list of places where I could land" depends on so many different factors, which the AI does not understand or appreciate. The most basic are water depth, clearance under bridges for the vessel, and whether there is a clear, navigable channel into shore for the person who is going to do this. Each of those factors can be determined by amoderately experienced person, but I haven't seen any AI explanation that indicates that the AI is even cognizant of them.

Expand full comment
Benjamin's avatar

A result with a more detailed query and adjusted system prompt to experienced sailor: https://claude.ai/share/7943ff73-0373-4b2b-90c3-d87052afe926

What are your thoughts? I have no clue about sailing.

Basically the workflow right now is getting the AI to a character that considers all the relevant details (ofc. easier if you actually know something about sailing). If I had to actually create a plan for a sailing trip I would probably take some writings on sailing and terminology to shift the distribution further. Right now I used very general instructions which didn't require any knowledge. For the boat I just copied something about the first sailing boat I found for sale in Miami.

You could probably further improve the results by iterating on asking for details things it missed/modifying the first query.

Expand full comment
rxc's avatar

This is very different from the guidance I received the first time I asked for it, and it looks pretty good. It talks about the important factors, including especially the weather, mast height, and fuel stops. For me, having done this about 7 times, it is a good summary.

I like the references to publications with specific recommendations for different areas and interests. I did not follow the links, but I will to see how many are useful.

Thanks for doing this. And with you not being a sailor, it is impressive. It is the sort of information that someone doing it for the first time needs to read and hear. The devil will be in the details, but there seem to be good references to follow for details.

I wonder whether it might have been possible for me to have teased this information when I first asked, or whether the model has "learned" something as a result of my inquiry, or you just spent enough time to tease out the info.

Expand full comment
Benjamin's avatar

For sure I was using a somewhat smarter model which is better at tool use and thinking but I expect the biggest shift was the instructions and system prompt I gave it.

With the LLMs we have the problem that we want them to be able to answer any question in the right way for any person in any context (excluding security concerns).

You could imagine ten different people asking about sailing or category theory who expect completely different answers because of their level of expertise or the context in which they ask the question. As humans we are usually embedded in a very strong context. On the internet this is already reduced but e.g. here I am in a rationalist space, from that and your posts I have some ideas about your level of wealth, where you are from, education, mental state that you are willing to change your mind and have some openness to experience, are probably male ...

Now, when a normal user asks a question their context of who they are and what kind of answer they want is incredibly under-determined. The LLMs are super-human in determining the context (depending on how they are finetuned, some of the finetuning just seems to be bad) so we don't realize what an impossible task we put in front of them when we ask them a question. You could be any human, anywhere with any experience asking the question.

If you want to use them as a tool I recommend checking here https://thezvi.substack.com/p/ai-119-goodbye-aisi?open=false#%C2%A7language-models-offer-mundane-utility and here https://www.lesswrong.com/posts/HjHqxzn3rnH7T45hp/do-you-even-have-a-system-prompt-psa-repo .

I/probably everyone am still trying to grasp how all of this works myself.

Expand full comment
rxc's avatar
Jun 29Edited

Thanks for your response. I wonder if part of my problem getting useful response was that I did not provide nearly as detailed a description as you did. I thought that I would ask a general question at the beginning, and that it would ask me all sorts of questions to figure out what I wanted to do, what were my limitations and background, etc.

I used to regulate nuclear power plants, and whenever a company wanted permission to do anything, it started a dialogue, between them and the government. They would provide a decent explanation of what they wanted, but it would never be complete, so we had to issue "Requests for Additional Information" and have meetings. Lots of meetings. The dialogue might be short, or it could take years, if the issue was complicated or novel.

I will always remember the final meeting for one particularly challenging/expensive/innovative idea, which we had approved, in a large meeting room with lots of executives sitting around. My people had a major role in this change, and at the very end I asked one innocuous question that I expected to be answered with a simple number.

What I got, instead, was a comment from the "gallery" of tech people who were not sitting at the table, but in chairs along the walls of the meeting room - "I guess we have to tell them." It made the hair on the back of my neck stand up. It turned out that they were going to use this change, in conjunction with a different change that we had approved separately, in a way that we had never considered or evaluated, and that they had never mentioned.

The entire approval process fell apart, at this moment. It took 6 months to sort out what, exactly, they were going to do, and a major part of the work needed to be re-done. And we withdrew approval of the new methodology until we could clarify that it did not apply to the situation that had been revealed.

This is why I am concerned about AI doing anything to do with nuclear plant safety. I have seen all sorts of things like this, over 25 years, and cannot imagine that a LLM could ask the simple question I did, receive the same answer, and connect it to an entirely different project, to change my mind about approving it. Only when you really understand something can you ask these sorts of questions and consider the consequences. A real dialogue is needed, not just a correlation of words and phrases.

And it is not just in the nuclear field where this sort of dialogue occurs. It happens with aircraft design, operation, and maintenance, in healthcare, and in all sorts of environmental evaluations. Probably in all sorts of financial issues, too.

My question about sailing north is not quite mundane, but I have had too many people ask me about it, and you can see that they want to do it, but have no idea what is involved. And there have been to many examples of unprepared people doing unwise things beyond their abilities that ended up with helicopter airlifts, search teams, and loss of life. AI responses like the one I got are going to encourage more of this behavior, when it should be educating people about what they need to learn before they set off into the wild blue yonder.

Expand full comment
comex's avatar

The quoted o3 response is indeed decent, for an LLM. But it includes multiple hallucinations. Item 2 claims that thinking models spend more FLOPs per inference token than non-thinking models spend per token, due to “attention over a longer context, extra policy heads, etc.”, which I’m pretty sure is nonsense. And item 5 suggests that the search depth shown in Figure 9 should have been used on the X axis in Figure 10, but it already is!

Many of the other points feel a bit logically garbled, like o3 has a valid idea but can’t quite express it correctly. To be fair, humans make these kinds of mistakes too. But if a human wrote that text, I’d think they probably dashed something off without fully thinking it through.

So I’d say that o3’s response actually does show the limits of its ability to reason. Obviously it’s a wild overstatement to say that it can’t reason at all. But it does have a tendency to get confused – just like we see in the paper!

Expand full comment
David Bachman's avatar

To me the most damning thing about the paper is the fact that the LLMs tanked at the same place on ToH with or without the algorithm given. I think most people would say an intelligent "solution" to ToH is the solution algorithm, not a sequence of steps in any particular instantiation. So they're not actually testing intelligence, they're just testing rule-following. Of course that's going to decline when you hit the token limit.

Expand full comment