8 Comments

“If it works well, it's probably not being published.”

I do want some Robin Hood-esque figure to exfiltrate everything and put it all out there. Maybe in a redacted/poisoned form so it can't be (immediately) replicated by criminal mid-wits, but enough so that the full topography of capability space can be understood or better predicted by those of us who have neither the skills nor the desire to swear fealty to some cyberpunk feudal lord.

It's possible that when we are shown things like GPT-4 or Sora these things are unveiled on a pretty substantial lag (maybe a fairly long one if you wanna wear a tinfoil hat about it), probably after some relevant parties have squeezed profit from the information differential.

Put differently: are there things that Sam Altman and Demis Hassabis can do today that would look like digital sorcery to us peasants even 6-12 months from now? I'd like to have a better model for that than I currently do.

Expand full comment

Interested in digging more into your thought about the potential "digital sorcery" that isn't yet publicly available. Samotsvety is working on forecasting something along these lines, but insights / speculation from informed readers here would be really interesting. Personally, I'd guess the biggest models trained at this point would be ~10^26 FLOPs, but I expect there have been many algorithmic advances / tricks around the edges that will improve performance.

Expand full comment

> Right now I mostly do not bother providing context on my queries. If that happened automatically, it would be a whole different ballgame.

Yeah this is a drag. I made this little tool for my own usage (https://gh-repo-dl.cottonash.com/), but anyone is free to use it, which eases the chore a bit. It'll give you all the text files in a folder, ZIP file, or GitHub repo as one large text file with a button to copy to clipboard. It still requires a few extra steps to do that and paste into Claude. I'm currently planning a way to close the loop and get it automatically inserted into a chat session for you...probably via a browser extension?

Expand full comment

> Suggestion is the cost is still actually dwarfed by the MLP block, and while generating tokens the cost is no longer n-squared, your marginal cost becomes linear.

I would like to note that this is generally wrong for sub-100B LLMs (within the space of sensible transformer width:depth ratios) above 100k ctx len. A typical llama/mistral layer's prefill forward pass has

mlp flops = $6*dim*hidden_dim*seqlen$

att flops = $4*dim*seqlen*(seqlen + dim*(1+kvheads/heads))$

You can plug these equations into a constraint solver to figure out that a model has to be very large for 1M context to be MLP-dominated. The same conclusion applies for token decoding, although that is at least partially allievated by decoding more often being memory bandwidth bound rather than FLOPs bound.

Expand full comment

braking mechanism, ie something that slows you down - not “breaking mechanism”

Expand full comment

> I would ask, what makes you think that you have found the only such circuit? If the model had indeed found a way around your interpretability research, would you not expect it to give you a deception circuit to find, in addition to the one you are not supposed to find, because you are optimizing for exactly that which will fool you?

Maybe I'm misunderstanding, but I don't think the model is in a position to "give you" a circuit to find any more than I am in a position to choose which of my neurons light up in some particular context. Less, if anything, because I can try to take up meditation based on an MRI or whatever, whereas the model has literally no introspection into its own weights until after the point at which they are fixed.

Expand full comment

I think all of our current safety techniques are very tightly coupled to us never getting live learning (or whatever we're calling the idea of updating weights post-training), and not having any capabilities emerging from whatever is in the context window (maybe you erase the deception circuit and the context window effectively adds it back).

Even if both of the above never happen, I definitely agree with Zvi that deception is probably not some feature you can detect and disable. It's such a fundamental part of any communication and will arise from any type of strategy that involves evaluating actions and picking ones that have higher expected value etc.

I have slightly higher hopes for us finding some feature that measures "would a human consider this output deceptive", but I wouldn't count on being able to use that to build a safe system.

Expand full comment

"If I include this list, there's a 10% chance. [...] Every single time I've done this they've said yes." → seems like Dwarkesh Patel is not well calibrated then.

Expand full comment