29 Comments

>From page 77, something about the math step by step thing has me curious.

>

>>However, if GPT-4 “takes its time” to answer the question then the accuracy easily goes up. For example, if we ask the model to write down the intermediate steps using the following prompt: What is the value of the following expression? 116 * 114 + 178 * 157 = ? - Let’s think step by step to solve the expression, write down all the intermediate the steps, and only then produce the final solution. then the accuracy goes to 100% when the numbers are in the interval 1 − 40 and to 90% for the interval 1 − 200.

>

>The explanation given by the paper is that this is the model being unable to plan ahead. I’ve heard similar claims elsewhere, and that is one logical way to interpret step-by-step working where direct answers don’t. I’m not sure I’d quite describe this as ‘it can’t plan’ so much as ‘it can’t do any steps or planning that isn’t part of its context window’ maybe?

This feature is one of the things that most strongly gives me the feeling that GPT4 is intelligent and reasoning like people do when interacting with it. I don't truly understand the ML well enough to say this with confidence, but it feels like there's a distinction between the kinds of logical inference that GPT-4 is doing "under the hood" with self-attention and the more complex things it's able to bootstrap itself into by taking intermediate conclusions and effectively adding them to the prompt. This feels a lot like how I would figure out the answer to 99 * 17:

Thought 1: 99 is just 1 off from 100 (this step is atomic to me and I can't further introspect how I noticed this).

Thought 2: OK, so 99 * 17 is the same as 100 * 17 - 17. (I can explain this in more detail if I needed to, but when actually reasoning, this is atomic.

Thought 3: 100 * 17 is 1700 (basic manipulation of the numbers).

Thought 4: 1700 - 17 is 1683 (mental arithmetic).

It's not obvious that the right way to solve this problem is by using mental math shortcuts rather than going for pen and paper, and if the problem was 64 * 131, just working it out is probably faster than my (poor) mental arithmetic skills.

The observation that GPT-4 is bad at math actually makes me think of it as more humanlike, because humans are also very, very bad at math. Math is hard for language-based reasoners because of the enormous overhead of using general-purpose cognitive machinery to manipulate numbers rather than shifting bits around or using evolved neural structures that are specialized for a particular math-like calculation.

Rather than saying that GPT can't plan, it's more like it's not good at noticing when a good strategy is to bootstrap its basic level of inferential power into higher-order reasoning by building towards the desired outcome with smaller steps. And then it's seduced by it's love of making up bullshit to sound like it knows what it's talking about. But the fact of being able to do the bootstrapped higher-order reasoning is miraculous! And the process of having to remind a reasoner that she can approach problems that are too complex to solve in one intuitive leap by breaking them down is very familiar from teaching math and reading to my kids. "What's the next word?" "I don't KNOW it's too HARD" "OK, start by sounding it out, I know you know that first sound..."

Expand full comment
author

I feel that last part so much. I have spent *so many* hours trying to get around exactly that "It's too HARD" complaint, often in places where I damn well know that it is super doable, or at least doable with practice and I'm there to help, yet - hands get thrown up. My oldest finally got actively good at one of my games, and I'm trying to use that as an example of the value of trying and Git Gud, maybe that will help?

Expand full comment

I've run into this a bunch too and it _seems_, at least for my own 'pupil', like more of a motivation (or even mood) issue than a generalizable argument about the possibility of 'gittin' gud'.

My most effective tactic is precisely the 'What's the next step?', e.g. "I know you know that first sound ...", tho sometimes it helps to recursively breakdown the 'next step' into a smaller chunk at least once too.

The reason why I suspect it's (mostly) an issue of motivation is that I regularly observe WILDLY different results/behavior depending on whether the goal/project is, e.g. 'finish one's homework' versus 'figure out how to do something in a video game'.

Expand full comment

The correct way to do this, if you're software, is to access the ALU of the processor you're running on and get the right answer, at least to machine precision, every time and billions of times faster than running it through your model weights.

It seems like the new ChatGPT with alpha features enabled might be able to do this since they gave it the ability to use tools. But ChatGPT without those features enabled isn't really better than humans at arithmetic even when prompting it with algorithmic hints, because most humans know how to use a calculator.

Other difference, of course, being once you teach a child, the child might actually acquire the skill eventually. You'd have to tell ChatGPT every single time you use it that it needs to do this, since it has no memory and can't update itself. I don't think this is a limitation of the research program so much as the business model. If OpenAI wanted to make just one ChatGPT that would store data and update its internal model as it learned new stuff online, it could do that, but if it wants to be able to serve and charge billions of human customers, it has to deploy billions of scaled-down inference-only models that are immutable and stateless in order to take advantage of the way the http protocol works and the nature of modern load balancing and autoscaling application infrastructure.

Not trying to really say anything here on whether this counts as "AGI" or not. I would guess it's a trivial task to add some limited tools like web search, but in the long run, OpenAI engineers probably don't want to hard-code a giant lookup table with tool uris and formalize an API to use them all. They want the model on its own to figure out what tool to use, where to find it, and how to use it. I'm done trying to predict when this happens, though. I'd say they're about 7-8 years ahead of what I would have expected 4 years ago. They're also quite a bit behind what I would have guessed in 1993, but I was 13 and only informed by the skeptic Agent Scully saying a sentient AI that could independently take over a building and successfully plan and execute a bunch of murders was a "few decades" away. I have no idea what an actual expert consensus was back then, but just looking at fiction like 2001, X-Files, and Westworld, it seems like "30 years from now" is a popular estimate, regardless of when "now" is.

Expand full comment

Is there a simple explanation out there for its “love of making up bullshit”? I find it fascinating and suggestive that this is a finding that comes up again and again in split-hemisphere studies in neuroscience: the left hemisphere (specialising in computational reasoning and language) also tends to make up bullshit and stick to it even in the face of overwhelming evidence to the contrary. I’ve had interactions with ChatGPT that feel eerily like I’m talking to a right-hemisphere stroke victim, and I sense some deep connection here. Something to do with inhabiting the map and lacking access to the territory, perhaps...

Expand full comment

No, it makes stuff up because of what LLMs are at their core. They are prediction models, not "know stuff" models. They don't know that 2+2=4, or that George Washington was the first president of the US. They take their training data and guess at what would make the best sense in response to the query. If the training data was full of really bad math, they would regularly get that wrong. If the training data said Thomas Jefferson or Walter Mondale were the first president, that's what it would say. Even if the rest of the training data very clearly made it impossible for Walter Mondale to be the first president, it would try to say it anyway.

When trying to complete a particular sentence, an LLM may find it fits the narrative it's forming better to use made up information with made up references, because that's equally accurate to real references and information.

Expand full comment

Right — and that last bit does sound a lot like “inhabiting the map [narrative] and lacking access to the territory.”

Expand full comment

For a lot of people it will only be "true" AGI when it does something that they can't imagine how it did it. This threshold will keep increasing as the technology gets better and eventually people will be saying "It's just an elaborate parlor trick" as their leg is disintegrating from the nanobots taking it apart for the AGI's space probes.

Expand full comment

But the "something" probably has to be something 'human scale' or 'human interesting' otherwise AlphaFold would have already met that criteria. Tho maybe 'protein folding' is already something almost-perfectly inscrutable to these kinds of skeptics so the idea that a 'program' could 'solve' it isn't evidence of 'intelligence' anyways.

Expand full comment

And, for a given definition of “parlour trick” and a given value of “elaborate,” perhaps they’ll be right!

Expand full comment

It looks like they Goodharted the definition from 1994, more than anything else. This has been a similar pattern with other metrics, where a specific criticism of a thing GPT does poorly gets fixed in the next version.

That's not to undersell the capabilities of the program, but it's still a program. It runs a process using a training corpus that would be worthless without that corpus. You can raise a human child in a variety of strange environments, including some with little or even no education, and it will still exhibit intelligence. If you cut GPT-4 off from the training data, it will exhibit zero intelligence.

It's a bit like looking at a textbook for some subject, and recognizing that the information contained within that textbook would pass some specified test. Let's saw a law textbook and the LSAT. The reason we don't consider the law textbook intelligent is that we recognize the fact that humans wrote the information in the book, and there's no mechanism for the book to independently use or exhibit the information contained within it. That's really the most remarkable thing about GPT-3 and further, the ability to understand and respond to text. So you program it with a law textbook, and use the fact that it can understand and respond to questions such as found in the LSAT, and it appears intelligent. But it doesn't know law any better than the textbook. It just has a way to talk about the information in the textbook.

The true advancement still appears to be the natural language comprehension. That you can plug in a data set and have the model spit out relevant examples from the data set is extremely *useful* but doesn't indicate intelligence.

Expand full comment

I think we are discovering that next word/character prediction is probably how most people get through life most of the time, and we call this intelligence.

Sometimes people perform abstract reasoning in their heads and then translate this into language. We also call this intelligence, but it may be an entirely different process.

It’s interesting that Girard is becoming more popular at the same time that LLMs are making so much progress. LLM’s form of prediction is basically mimesis.

Expand full comment

I agree with this in large part. Realizing that most (relatively) 'intelligent' people couldn't, or didn't, perform "abstract reasoning" using models was shocking and disturbing to much-younger me.

There's, of course, much more to our (and other creature's) behavior and 'intelligence', but it IS fascinating how much LLMs can seemingly replicate.

Expand full comment

David Deutsch makes the point that the better AI becomes, the further away from AGI it gets. Can it decide not to answer? Can it imagine or guess? No? It’s not AGI, then. (Hear him explain that much better, and so much more infinitely interesting besides, on Tim Ferris’s latest podcast episode.)

Expand full comment
author

I mean, I thought it could guess pretty well, no?

Expand full comment

I am reviewing this paper as a post for people who are not aware of where AI is today.

So far I agree with you, this is not close to AGI. However are we discounting the things an LLM can do? What I am saying is just because a model predicts the next word or number or letter, are we biasing it as being "dumb"?

Perhaps we can update to: If a model can predict the next thing very accurate, if its able to stop at the right moment and if it has ability to correct, then perhaps it is quite smart, maybe smarter than most humans?

Expand full comment

Some specific things that a "thinking" "reasoning" or "intelligent" system would do differently than an LLM:

1) Recognize that what it's saying is logically impossible (some of the bad math it does)

2) Not make up references to non-existent websites or studies (at least it would know it was doing such and not accidently do it as is currently common)

3) Extrapolate a thought beyond a combination of things within the training set (what we might call invention)

So I would say that current LLMs (and likely all devices under the terminology LLM) are by their very nature "dumb." No comment on whether it's "smarter" than most humans, as that has lots of deep implications about how smart many humans are, and how we would really assess that.

Expand full comment

From personal testing of GPT-4 with truly absurd theory of mind scenarios that can be nothing like what appears in the training set, and reflecting on a few papers I've read recently (notably the papers about 'grokking' (Alethea Power et al 2022) and apparent (small) world models when trained purely on inputs that describe actions in the world, not the world itself (Li et al's 2022 Othello paper)), it seems likely to me that LLMs like GPT-4 aren't using memorisation or surface statistics to work out how to answer maths or theory of mind questions. I believe they have fragments of world models and specialised algorithms that got developed during training and get called upon to help answer those questions. I'd love to be able to go into research to help find those inner models/algorithms.

Once they can do the same (or better) with as little training data as a human gets in their childhood, and especially once they can intermingle training and inference instead of stopping training once you start using them... It'll become a lot harder for people to believe that these LLMs are just learning surface statistics.

Expand full comment

> Once they can do the same (or better) with as little training data as a human gets in their childhood

One large issue is the type of training that is occurring. GPT is effectively being trained the same way a child who learned 100% of their knowledge from watching TV (and rarely, if ever, interacting with another human, even just physically being present and being able to observe how they react to anything the child says or does).

Impressive that it can do anything, but qualitatively different in a way that may prove fundamentally faulty. It's also difficult to know just what "volume" of learning is actually happening from real-time interaction in the world, vs large volume, essentially "turn based" interactions.

> and especially once they can intermingle training and inference instead of stopping training once you start using them

Architecturally, this would fundamentally be an entirely different type of AI than existing LLMs. It would be interesting to see how such an algorithm might function, both in terms of its capabilities and its failure modes, which would likely be different from existing LLMs. Previous experiments of that sort have demonstrated large vulnerabilities to targeted attacks (see: Tay), though that's neither a guaranteed issue, nor the only possible one.

Expand full comment

A conversation I've been having a lot lately concerns the definitional indiscipline in and around AI, for which I think this paper is an interesting case study.

Before there was any notion of AI-as-going-concern, there was an absence of exact consensus as to what an appropriate definition of AI should be, but a lot of conversations would refer to some combo of "full realisation of consciousness; emulation of human brain latency/FLOP output with commensurate energy efficiency; can originate motivation etc." Particularly re: the last point, the entire philosophy of alignment was prompted by the idea that intelligence would naturally lead to agency, which is often a dangerous unknown in a powerful actor.

Then AGI was coined ostensibly to refer to the above characteristics as AI itself was 'downgraded', having been mercilessly exploited as a buzzword for a vast range of optimisation solutions during the 10s. Now AGI's definition is beginning to change again, to denote (as it does in this paper) a system with powerful self-optimising features that can generate value outputs of several kinds from relatively simple inputs.

I worry that this non-descriptive approach to definitions is not only likely to confuse general analytic faculty, but is also virtually guaranteed to result in misapprehension of both the real risks and real opportunities in the space. I think this can be seen in, for instance, the continued adherence to the belief that GPT is a vindication of the scaling hypothesis (scaling definitely 'works' to an applied degree, but the system has already far outstripped human performance in key areas without any flickers of consciousness as traditionally conceived having emerged), and to the absence of anyone venturing in detail to imagine how, for instance, GPT as a productivity relativiser could be used to reduce work-hours-per-unit-of-value-created.

There's a mix of epistemological indiscipline, extremely aggressive commercialisation, and complex system engineering at play here, interacting in a way that has probably never been seen before so early in a field's lifespan.

Expand full comment

Wonder what's your take on these sentences from the paper:

>Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.

>With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.

Expand full comment

> Section 9 talks about societal impacts and misinformation and bias and jobs and other things we have heard before.

You know, it somehow hadn't quite occurred to me until today that a considerable number of the complaints about LLMs boil down to, "They talk like people, and people are terrible, and so this is Bad. Something must be done."

Expand full comment

> "They talk like people, and people are terrible, and so this is Bad. Something must be done."

There's two reasons for this:

1. If a person does something that people don't want to be done, there are mechanics for changing that behavior, either in the form of reduced access (you did a racism on Twitter, so now you don't get to use Twitter, or you did a murder in the real world, so now you don't get to use the real world and have to live in jail), or attempted behavioral modification (you did a racism on Twitter, so now a bunch of people are going to yell at you in the hopes that it makes you not want to do a racism on Twitter anymore). Whether these are actually effective or not, they are generally socially considered "good enough" for letting people do things in public. We definitely know that neither of these can possibly have any effect on existing LLMs (but maybe it can on either their creators or users), since they don't change after their training period ends, so there's broad concern beyond where there would be for a human.

2. Scale. A single human who does bad things is limited (or at least an attempt is generally made to try to limit) in volume. Thus, CAPTCHAs, bans on owning nuclear weapons, in most parts of the world bans on guns, etc. Regardless of actual effectiveness, the scale of potential harm of a given tool is generally considered alongside its benefits. A massive scale centralized system that has a large number of the same flaws that common in ordinary people is a higher risk than, say, a massive scale centralized system that is capable of (but not built on) spreading those same flaws (ie Twitter vs LLM) because it's easier to find/be impacted by them if it's inherent rather. In terms of scale, it would probably be worth comparing to the backlash Elon got from buying Twitter (because of the presumed or actual risk of giving a "bad person" the much wider reach of, theoretically, every single user on Twitter).

Expand full comment

Both reasonable points! It does seem like to the extent we want to rule certain speech out of bounds, it's more important that it be done for LLMs than for any individual person. At the same time, my comment -- mostly snark -- was prompted by my sense that there are very many people in the US who want to rule a lot of opinions that many of their fellow citizens hold out of bounds, and want LLMs to only be able to express *their* opinions. I worry a lot about American culture gradually abandoning pluralism as a terminal goal.

Nevertheless, well-said and I appreciate the thoughtful response to a tossed-off snarky comment that probably didn't deserve it :)

Expand full comment

Anyone know why hallucination took off as the de facto term for ‘confidently state a falsehood’? It really isn’t the right description at all. It feels closer to ‘delusion’ or ‘lying’, but suppose these terms are equally loaded!

Expand full comment

If a human made up citations from whole cloth we would call it a lie, and consider it an attempt to save face or otherwise inflate the speaker's worth in the eyes of others by not having to say "I don't know." It seems the LLMs are being trained to not say "I don't know" even when the alternative is to create completely fake information. Perhaps the reinforcement process is to blame, maybe the humans are "praising" good looking answers and not checking the sources to see if they even exist.

Expand full comment

Hallucination will always be ‘object without percept’ for me. And since for LLMs there is no percept (is there even an object?) - needs a better word!

Expand full comment

Bit of a belated response, but as someone who used to work in a psychiatric hospital and encountered this a lot, I've been pushing for 'confabulation'.

https://www.verywellmind.com/confabulation-definition-examples-and-treatments-4177450

Expand full comment

That works!

Expand full comment