Discussion about this post

User's avatar
hwold's avatar

> He comes back to this idea that knowing things is a disadvantage. I don’t get it.

I think I get it.

He doesn’t means, "for an optimal reasoner, knowing things is a disadvantage". Or "for most real-world agents, knowing things is a disadvantage".

He’s saying that in the current paradigm of pretraining => RL, knowing more things from pretraining is a disadvantage, because it gets in the way of RL. RL is supposed to be the thing that "train the cognitive core", but it will always take the shortest path, and often the shortest path is "regurgitate the teacher password from training data", which is not very helpful for the goal of "train the cognitive core". If it didn't have the teacher password in its training data, it would have to learn to reason to find the answer.

It’s very similar to your observation "if you don't want to learn, LLMs are a formidable tool for that", but applied to LLMs themselves. Humans can avoid learning by just using LLMs. LLMs can avoid learning to reason by just using memories from training data.

> My prediction is that the cognitive core hypothesis is wrong

It seems clearly right to me ? It’s all the point about learning "mental movements" from LessWrong, like "imagine worlds where hypothesis is true, imagine worlds where hypothesis is false, compare". Reasoning is made of steps, the dance is not at all trivial, you have to learn it. This looks like something very generic and compact to me.

Look at all the mistakes LLMs make. "Unable to get out of a prior". This is not a lack of knowledge, this is a clear misstep in the dance.

Expand full comment
Seta Sojiro's avatar

"I’m not convinced. As Dario points out, in theory you can put everything in the context window. You can do a lot better on memory and imitating continual learning than that with effort, and we’ve done remarkably little on such fronts."

This is where I agree with Andrei and not with Dario. You can't just put everything in context for two reasons.

The first is quite practical - proto-AGI needs to be able to digest video or at least sequential images (perhaps a screenshot every second though preferably better than that). Even with a context length of 100 million tokens, one frame per second screenshots runs you out of context length within minutes. And high quality video can run you out of context length in seconds.

The second is more subtle. In context length learning compresses knowledge but so far it has not shown to result in deeper understanding. It's not enough to compress knowledge, models need to be able to absorb deeper abstractions and apply those to new tasks. Right now, in context learning is the equivalent of an amnesiac who has to consult a book before making each decision. That's fine for some tasks (those that are primarily data retrieval) but it's inadequate for many if not most tasks.

To give a concrete example - I think it's fair to say a proto-AGI should be able to play a video game. Imagine a very simple video game sequence, avoiding a goomba in the original Mario. The first time you run into it and die. The second time you remember you died the first time so you think about how to avoid it - you use the jump button. By the third time, you don't even have to think about it - you see goomba, you automatically jump over it. But in the real game, there are dozens of little things you learn that are like this - and you no longer have to consult your memory of what happened before to solve new problems - you've abstracted the core properties that let you solve new challenges. Simply adding context length will never achieve this sort of learning.

Expand full comment
43 more comments...

No posts

Ready for more?