Discussion about this post

User's avatar
Davis Yoshida's avatar

Read the following comment as the ML equivalent of an "in mice" reply to bio/medicine hype posting:

#15. Some context I think the explanation misses: There have been _tons_ of papers (starting right after BERT blew up) proposing ways of extending transformer context length. Just to name a few, you've got:

1. Transformer-XL

2. The Compressive transformer

3. ReFormers

4. State space models

5. Approximately a million different efficient attention variants which reduce the cost which Pete's explanation discusses

The thing is, none of these have replaced good old fashioned (basically-)dense attention for large model pretraining yet, and I don't think the experiments in this paper establish that it will be the one to do so. It might be, but the question is always just "should you spend your flops on bigger model/more data or longer context".

Expand full comment
Victualis's avatar

I probably missed earlier versions, but I found Eliezer's point in the LeCun debate very important: humans can't engage in consensus protocols with agents that rely on reasoning about the source code of the agents. I'm familiar with the work, but it's always seemed a positive before: look, agents can achieve deterministic consensus with guarantees. This is the first time I understood Eliezer's point that this is bad for humans.

Expand full comment
41 more comments...

No posts