8 Comments

“If it works well, it's probably not being published.”

I do want some Robin Hood-esque figure to exfiltrate everything and put it all out there. Maybe in a redacted/poisoned form so it can't be (immediately) replicated by criminal mid-wits, but enough so that the full topography of capability space can be understood or better predicted by those of us who have neither the skills nor the desire to swear fealty to some cyberpunk feudal lord.

It's possible that when we are shown things like GPT-4 or Sora these things are unveiled on a pretty substantial lag (maybe a fairly long one if you wanna wear a tinfoil hat about it), probably after some relevant parties have squeezed profit from the information differential.

Put differently: are there things that Sam Altman and Demis Hassabis can do today that would look like digital sorcery to us peasants even 6-12 months from now? I'd like to have a better model for that than I currently do.

Expand full comment

> Right now I mostly do not bother providing context on my queries. If that happened automatically, it would be a whole different ballgame.

Yeah this is a drag. I made this little tool for my own usage (https://gh-repo-dl.cottonash.com/), but anyone is free to use it, which eases the chore a bit. It'll give you all the text files in a folder, ZIP file, or GitHub repo as one large text file with a button to copy to clipboard. It still requires a few extra steps to do that and paste into Claude. I'm currently planning a way to close the loop and get it automatically inserted into a chat session for you...probably via a browser extension?

Expand full comment

> Suggestion is the cost is still actually dwarfed by the MLP block, and while generating tokens the cost is no longer n-squared, your marginal cost becomes linear.

I would like to note that this is generally wrong for sub-100B LLMs (within the space of sensible transformer width:depth ratios) above 100k ctx len. A typical llama/mistral layer's prefill forward pass has

mlp flops = $6*dim*hidden_dim*seqlen$

att flops = $4*dim*seqlen*(seqlen + dim*(1+kvheads/heads))$

You can plug these equations into a constraint solver to figure out that a model has to be very large for 1M context to be MLP-dominated. The same conclusion applies for token decoding, although that is at least partially allievated by decoding more often being memory bandwidth bound rather than FLOPs bound.

Expand full comment

braking mechanism, ie something that slows you down - not “breaking mechanism”

Expand full comment

> I would ask, what makes you think that you have found the only such circuit? If the model had indeed found a way around your interpretability research, would you not expect it to give you a deception circuit to find, in addition to the one you are not supposed to find, because you are optimizing for exactly that which will fool you?

Maybe I'm misunderstanding, but I don't think the model is in a position to "give you" a circuit to find any more than I am in a position to choose which of my neurons light up in some particular context. Less, if anything, because I can try to take up meditation based on an MRI or whatever, whereas the model has literally no introspection into its own weights until after the point at which they are fixed.

Expand full comment

"If I include this list, there's a 10% chance. [...] Every single time I've done this they've said yes." → seems like Dwarkesh Patel is not well calibrated then.

Expand full comment