Discussion about this post

User's avatar
Seta Sojiro's avatar

There seems to be an unbridgeable gulf between people who think that continual learning is important, and people who think it doesn't matter.

I doubt I'll move the needle on this, but I'll try anyways. Expanding context just doesn't work. For many reasons.

First - expanding context length requires a quadratic increase in compute - if you want to make a model with a context length 10 times longer, you need 100 times more compute. Maybe you could get to a few million tokens, but a hundred million? Not possible with the design of frontier models.

Second - text is compact, but sensory data runs you out of context very quickly. Just a few seconds of video would technically exhaust the context length of current models if it were converted into tokens in a lossless fashion. In reality, current approaches heavily compress multimodal data - ie. throw away most of the information, which is a fine trick but it heavily limits what you can do with it.

Third - in context learning is only really useful for information storage and retrieval. It doesn't allow models to generate new abstractions - new ways of making sense of information. But maybe this isn't a fundamental limitation, maybe it's just because of how current models are trained. Could you create a training environment that forces models to make use of in context information to generalize new patterns of thought? Not really, because of the next point.

Four - learning is too inefficient. Current pre-training and RL techniques do indeed allow models to learn new abstractions and skills, but they require tens of trillions of tokens. And a large chunk of those are specifically to improve coding and mathematics skills. It's hard to fathom that number - that's the equivalent of tens of millions of textbooks. A human with a comparable amount of computer science and mathematical skill has read maybe a few dozen textbooks? So if it takes on the order of trillions of tokens to learn new skills, how much can a model learn within a context length of 1 million? Not much.

Amicus's avatar

> Dario worried Silicon Valley and those connected to it could grow at 50% while everyone else grows at not much above the normal 2%. He says that would be ‘a pretty messed up world.’ I think that turns out fine. You tax the part growing at 50%, everyone wins.

Failing to adequately compensate the losers of a Kaldor-Hicks improvement is a completely normal outcome, not an exceptional case that requires the winners to completely "take over".

43 more comments...

No posts

Ready for more?