Microsoft Research Paper Claims Sparks of Artificial Intelligence in GPT-4

Mar 24, 2023

Microsoft Research (conflict of interest? what’s that?) has issued a 154-page report entitled Sparks of Artificial Intelligence: Early Experiments With GPT-4, essentially saying that GPT-4 could reasonably be viewed as a kind of early stage proto-AGI.

This post will go over the paper, and the arguments they offer.

Here is their abstract:

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an unprecedented scale of compute and data.
In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models.
We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT.
Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

The paper is about an early and non-multimodal version of GPT-4. I do not think this much impacted the conclusions.

Their method seems to largely be ‘look at all these tasks GPT-4 did well on.’

I am not sure why they are so impressed by the particular tasks they start with. The first was ‘prove there are an infinite number of primes in the form of a rhyming poem.’ That seems like a clear case where the proof is very much in the training data many times, so you’re asking it to translate text into a rhyming poem, which is easy for it - for a challenge, try to get it to write a poem that doesn’t rhyme. Variations seem similar, these tasks almost seem chosen to be where GPT-3.5 was most impressive.

Introductions don’t actually matter, though. What’s the actual test?

We execute the approach outlined above on a few selected topics that roughly cover the different aptitudes given in the 1994 definition of intelligence, a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.
[Note: List here is edited to remove detail.]
GPT-4’s primary strength is its unparalleled mastery of natural language. It can not only generate fluent and coherent text, but also understand and manipulate it in various ways, such as summarizing, translating, or answering an extremely broad set of questions.
Coding and mathematics are emblematic of the ability to reason and think abstractly. We explore GPT4’s abilities in these domains respectively in Section 3 and Section 4.
In Section 5, we test the model’s ability to plan and solve problems as well as to some extent to learn quickly and learn from experience by having it play various games (or, flipping the table, simulate a game environment), as well as interact with tools. In particular, the fact that GPT-4 can use tools (including itself) will certainly be of immense importance to build real-world applications with GPT-4.
An important part of our argumentation is that GPT-4 attains human-level performance on many tasks. As such, it is natural to ask how well GPT-4 understands humans themselves. We show several experiments on this question in Section 6.
Throughout the paper we emphasize limitations whenever we found one, but we also dedicate Section 8 to an in-depth analysis of the lack of planning, likely a direct consequence of the autoregressive nature of GPT-4’s architecture.
Finally in Section 9, we discuss the expected societal impact of this early form of AGI, and in Section 10, we share key challenges, directions, and next steps for the field.

This sounds a lot like they set out to make the case for GPT-4 as a proto-AGI, and are making an argument, rather than researchers exploring. That’s fine, it is what it is, so long as they also note limitations and we take it in the spirit it is offered.

I am definitely not on Team Stochastic Parrot. I do still notice the parrot-like shapes.

For example, from the sample Socratic dialogue, it’s clear what GPT-4 is doing:

Aristotle: But isn’t that the same argument you made against rhetoric and oratory? Socrates: Yes, and I stand by that argument as well.

Don’t get me wrong. It’s a clever trick. Good idea. Points.

The rest of the paper mostly seemed like a standard walkthrough of the things a GPT model might plausibly do. GPT-4 does well, but none of it feels like new information.

The next new information is its ability to use tools if the opportunity is offered - if you say it can use functions like CHARACTER(string, index) or SEARCH(“query”) or CALC(expression) it will do so, which makes sense given that if you’re giving it the option the chance that the next token uses that option seems high? A better test would be if one set it up so it wasn’t such a leading situation. Still interesting. I’d want to explore more exactly what the boundaries of this might be:

In Figure 5.7, we ask GPT-4 to solve an easy task using a very unusual API. Instead of adapting to the unusual function, GPT-4 calls them as if they were the usual version, i.e., it calls reverse get character as if it were get character, and reverse concat as if it were a simple concat

As I understand it, failure to properly deal with negations is a common issue, so reversals being a problem also makes sense. I love the example on page 50, where GPT-4 actively calls out as an error that a reverse function is reversed.

The ability I’ve seen most commonly on Twitter is that GPT-4 can track navigation between rooms in a text adventure, and draw a map afterwards. Cool.

in 6.1, GPT-4 is then shown to have theory of mind, be able to process non-trivial human interactions, and strategize about how to convince people to get the Covid-19 vaccine far better than our government and public health authorities handled things. The rank order is clearly GPT-4’s answer is very good, ChatGPT’s answer is not bad, and the actual answers we used were terrible.

From page 77, something about the math step by step thing has me curious.

However, if GPT-4 “takes its time” to answer the question then the accuracy easily goes up. For example, if we ask the model to write down the intermediate steps using the following prompt: What is the value of the following expression? 116 * 114 + 178 * 157 = ? - Let’s think step by step to solve the expression, write down all the intermediate the steps, and only then produce the final solution. then the accuracy goes to 100% when the numbers are in the interval 1 − 40 and to 90% for the interval 1 − 200.

The explanation given by the paper is that this is the model being unable to plan ahead. I’ve heard similar claims elsewhere, and that is one logical way to interpret step-by-step working where direct answers don’t. I’m not sure I’d quite describe this as ‘it can’t plan’ so much as ‘it can’t do any steps or planning that isn’t part of its context window’ maybe?

Another example of ‘failure to plan’ given on page 79 is the first sentence of a poem that has to end with a last sentence that has the same words.

I suspect this is actually a negation issue. GPT-4 doesn’t know what ‘reverse’ means and it gets confused by ‘not’ grammatically correct, and I notice that it is pattern matching ‘I saw her smile in the morning light’ to ‘I heard his voice across the crowd’ instead.

I tried to get GPT-4 to give me an assignment for itself that would ask it to write a poem that would test its planning capabilities. My first attempt failed because it instead gave me the request in the form of a poem. The second attempt it clearly knew what I wanted, yet failed, with its rule being ‘each line is longer than the previous one.’

You can walk GPT-4 carefully through planning things, however that is the opposite of the ability to plan.

Section 9 talks about societal impacts and misinformation and bias and jobs and other things we have heard before. Mostly it went over all the old material. I did find this chart to be clearer than past examples of the same issue, also to note that I am sad that my ‘word distribution’ misreading of the second column was wrong, I want that data:

Still not a new concept, and strange (given what we care about these days) that this has not yet been fixed, even partially.

I also notice this continued concern for what this paper calls an “AI divide” where AI is unavailable to many. It seems to me like this is already essentially falsified. You might complain the AI you can access is too restricted or ‘too woke’ or something, but it does not look like AIs are going to be expensive unless you are doing something very intense. Worst case, you’re a little bit behind the curve.

(Yes, at some point that means ‘you are dead’ but that is not an economics problem.)

Does this all add up to a proto-AGI? Is it actually intelligent? Does it show ‘sparks’ of general intelligence, as the paper words it?

I mean, yes, there is that.

Ultimately it depends what you think it means to be an AGI, and how much deeper this particular rabbit hole can go in terms of capabilities developments. All the standard arguments, for and against, apply.

Their discussion about how to make it more intelligent involves incremental improvements, and abilities like confidence calibration, long-term memory and continual learning. The rest of the list: Personalization, planning and conceptual leaps, transparency, interpretability and consistency, improvement on cognitive fallacies and irrationality, challenges with sensitivity to inputs. Continual learning does seem like a potential big step in this. Many others seem to involve a confusion between capabilities that cause intelligence, and capabilities that result from intelligence.

From my perspective, the paper did not answer that question or update me much in either direction. It had some good and useful detail and examples, not that much beyond that. For those less in the loop, it would be more useful, but I have no idea why they would be reading this kind of paper.

I continue to not view GPT-4, while in many ways highly impressive, as being that close to AGI.

Steven Chicoine

For a lot of people it will only be "true" AGI when it does something that they can't imagine how it did it. This threshold will keep increasing as the technology gets better and eventually people will be saying "It's just an elaborate parlor trick" as their leg is disintegrating from the nanobots taking it apart for the AGI's space probes.

Expand full comment

2 replies

Robert Beard

Mar 24, 2023Edited

>From page 77, something about the math step by step thing has me curious.

>>However, if GPT-4 “takes its time” to answer the question then the accuracy easily goes up. For example, if we ask the model to write down the intermediate steps using the following prompt: What is the value of the following expression? 116 * 114 + 178 * 157 = ? - Let’s think step by step to solve the expression, write down all the intermediate the steps, and only then produce the final solution. then the accuracy goes to 100% when the numbers are in the interval 1 − 40 and to 90% for the interval 1 − 200.

>The explanation given by the paper is that this is the model being unable to plan ahead. I’ve heard similar claims elsewhere, and that is one logical way to interpret step-by-step working where direct answers don’t. I’m not sure I’d quite describe this as ‘it can’t plan’ so much as ‘it can’t do any steps or planning that isn’t part of its context window’ maybe?

This feature is one of the things that most strongly gives me the feeling that GPT4 is intelligent and reasoning like people do when interacting with it. I don't truly understand the ML well enough to say this with confidence, but it feels like there's a distinction between the kinds of logical inference that GPT-4 is doing "under the hood" with self-attention and the more complex things it's able to bootstrap itself into by taking intermediate conclusions and effectively adding them to the prompt. This feels a lot like how I would figure out the answer to 99 * 17:

Thought 1: 99 is just 1 off from 100 (this step is atomic to me and I can't further introspect how I noticed this).

Thought 2: OK, so 99 * 17 is the same as 100 * 17 - 17. (I can explain this in more detail if I needed to, but when actually reasoning, this is atomic.

Thought 3: 100 * 17 is 1700 (basic manipulation of the numbers).

Thought 4: 1700 - 17 is 1683 (mental arithmetic).

It's not obvious that the right way to solve this problem is by using mental math shortcuts rather than going for pen and paper, and if the problem was 64 * 131, just working it out is probably faster than my (poor) mental arithmetic skills.

The observation that GPT-4 is bad at math actually makes me think of it as more humanlike, because humans are also very, very bad at math. Math is hard for language-based reasoners because of the enormous overhead of using general-purpose cognitive machinery to manipulate numbers rather than shifting bits around or using evolved neural structures that are specialized for a particular math-like calculation.

Rather than saying that GPT can't plan, it's more like it's not good at noticing when a good strategy is to bootstrap its basic level of inferential power into higher-order reasoning by building towards the desired outcome with smaller steps. And then it's seduced by it's love of making up bullshit to sound like it knows what it's talking about. But the fact of being able to do the bootstrapped higher-order reasoning is miraculous! And the process of having to remind a reasoner that she can approach problems that are too complex to solve in one intuitive leap by breaking them down is very familiar from teaching math and reading to my kids. "What's the next word?" "I don't KNOW it's too HARD" "OK, start by sounding it out, I know you know that first sound..."

6 replies by Zvi Mowshowitz and others

27 more comments...

Don't Worry About the Vase

Discussion about this post