Discussion about this post

User's avatar
Amaryllis's avatar

Some additional questions around interpretability:

- How hard is it to understand current transformer based LLMs?

- How hard is it to understand whatever architecture replaces transformer based LLMs?

- Are RNNs easier to understand than transformers?

- Are fully binary quantized combinational logic models easier to understand than non-quantized models?

- Can existing transformer based LLMs be fully quantized to combinational logic?

- How easy is positive interpretability (understanding the main thing that a given component does)?

- How easy is negative interpretability (proving that a given component does not implement any other functionality)?

- How much of a GPT style LLM is a compressed lookup table vs a weird machine built on top vs a simulation of a human mind built on top of that?

- How does the distribution of simulation layers change with compute/parameters/data/architecture?

- How much do different flavors of reinforcement learning change interpretability?

- How well does interpretability scale with model size?

- Does interpretability improve capabilities?

- Are architectures/training methods amenable to interpretability better or worse for capabilities?

- Is good interpretability sufficient for alignment?

Expand full comment
Nathan Young's avatar

I think we could run a pretty interesting survey on these if you want. I've been building some tech that makes it easy to get agreement/disagreement on this kind of stuff from many people

Expand full comment
3 more comments...

No posts