Introduction This post is a highly incomplete list of questions where I either have large uncertainty, have observed strong disagreement with my perspective ,or both, and where changing someone’s mind could plausibly impact one’s assessment of how likely there is to be a catastrophe from loss of control of AGI, or how likely such a catastrophe is conditional on AGI being developed.
A well thought out lengthy list of depressing questions/ scenarios. Regardless of our analysis, conclusions, and potential AI regulation, does anyone think that countries outside the US Empire (China) will do anything other than race ahead at maximum speed toward AGI? I suspect outcomes will be better if the West “wins”.
Some additional questions around interpretability:
- How hard is it to understand current transformer based LLMs?
- How hard is it to understand whatever architecture replaces transformer based LLMs?
- Are RNNs easier to understand than transformers?
- Are fully binary quantized combinational logic models easier to understand than non-quantized models?
- Can existing transformer based LLMs be fully quantized to combinational logic?
- How easy is positive interpretability (understanding the main thing that a given component does)?
- How easy is negative interpretability (proving that a given component does not implement any other functionality)?
- How much of a GPT style LLM is a compressed lookup table vs a weird machine built on top vs a simulation of a human mind built on top of that?
- How does the distribution of simulation layers change with compute/parameters/data/architecture?
- How much do different flavors of reinforcement learning change interpretability?
- How well does interpretability scale with model size?
- Does interpretability improve capabilities?
- Are architectures/training methods amenable to interpretability better or worse for capabilities?
- Is good interpretability sufficient for alignment?
I think we could run a pretty interesting survey on these if you want. I've been building some tech that makes it easy to get agreement/disagreement on this kind of stuff from many people