This week’s biggest AI development was the rise of AutoGPT and the concept of turning GPTs into agents via code wrapping that gives them memory, plans and the ability to spin up instantiations.
I'd like to offer my thoughts on a point Eliezer raised regarding knowledge practitioners possess that he may not have access to. I think I've found an example. Reinforcement Learning sucks, like a lot. Major accomplishments still depend on Proximal Policy Optimization (PPO) from 2017. The algos are hilariously sample-dependent, require near-perfect simulations of the test environment (video games, GO) and are very brittle.
The field continues to optimize for Atari, which essentially involves training and testing on the same data. Improvements in these benchmarks do not appear to be transferable to downstream tasks, as they primarily reflect better memorization of the environments. Projects like AutoGPT represent naïve implementations of agents, and the next logical step is to apply RL to them and be dissapointed. A truly effective RL algorithm would be alarming and lend support to concerns about rapid advancements, but for now, supervised learning remains the dominant approach. Reinforcement Learning with Human Feedback is mainly a gimmick, and similar results have been achieved using simple supervised fine-tuning.
Whenever I come across catastrophic scenarios that involve any form of model feedback incorporating RL, I can't help but think that's what an outsider might be missing—they are unaware that RL actually sucks.
I want to distinguish (A) "imminent AI catastrophe requiring no new AI ideas" from (B) "AI catastrophe more generally". Your comment seems to be saying that (A)-scenarios involving RL are implausible, but that (B)-scenarios involving RL are totally on the table. ("A truly effective RL algorithm would be alarming…"). If so, I agree with both. And I think Eliezer agrees with both too, from what I gather. So I'm not exactly sure who you're responding to.
For example, the human brain centrally involves RL—much more centrally than does the LLM+RLHF paradigm. And it seems to be not only highly effective for us, but IMO a very important ingredient to human capabilities. So for my part, I put a lot of stock in (B)-scenarios involving RL. (And it's hard to rule out that (B)-scenarios involving RL could happen soon; people are trying every day to make RL work better, and we should be open-minded to the possibility that they might succeed.)
Rereading there definitely was some level of screaming in the void in my original comment. The general open forum discussions are pretty frustrating and every AI space are now flooded by such discussions.
As for point B, would you say that your estimation of it happening in the near future has altered in 2023?
From my perspective, it appears that the field has been somewhat stagnant for the past five years, gradually increasing its capacity through scaling laws. The recent introduction of what is essentially a GUI for half a decade of scaling law work has created an illusion of rapid changes occurring within just a few months, but the changes from GPT-3 to GPT-4 are pretty minor compared to what GPT-2 was to GPT-1 or GPT-3 to GPT-2. Scaling laws remain a power-law, and theres only so much good data and compute.
There's a scenario "In the near future (say 5 yrs), someone makes an AI system that centrally involves LLMs (or their direct successors) but also maybe has new AI ideas, and it gets sufficiently powerful to pose an x-risk". This is not the scenario that I spend most of my time thinking about, but it's hard to rule out, because "new AI ideas" is a very big space of possibilities, including possibilities that haven't occurred to me. (And people haven't been messing around with LLMs long enough for all the low-hanging fruit to be already picked, I think.) Yeah, I do think I take that scenario more seriously than I did a year ago, even if it's still not my main expectation. (My main expectation is still that LLMs (and LLM-centric systems) will plateau before they pass the x-risk threshold.)
(Maybe it's long been obvious to you that LLMs can get to where GPT4 is, but that wasn't obvious to me. Granted, I wasn't thinking about it much. But anyway, that's why I take the scenario more seriously than a year ago. That plus peer pressure from all the other alignment researchers. Oops, did I say peer pressure? I meant: "some of my reasonable-seeming colleagues, who have invested much more time than me into thinking about LLMs, seem to take that scenario very seriously, so I should at least stay open-minded to it". Shrug.)
I want to push back a bit against the section about being able to predict the end (because all roads lead to Rome) but not being able to predict details about the journey, because I have a pretty different lens that I default to about this than the one that you do.
In my view, basically what we are trying to do is develop models of the world or parts of the world and then develop hypotheses that draw on those models and test them against each other. My model has all sorts of assumptions -- intelligence is messy, there's no such thing as G, species' competitive advantage might come from many places, AI might become smart in some ways while remaining very dumb for a long time in others, etc etc. Your model has some similar assumptions to mine (like AI will become superhuman in some ways soon, AI is very impressive already etc) but many differences.
Most of what I try to do when I read these posts is try to update my model based on contact with your model and thinking through your arguments. But it is very hard to make any strong changes to my model without falsifiable hypotheses to test against each other. Very much to your credit, you do tend to make a lot more falsifiable hypotheses than Yudkowsky. But I think that the argument that Yudkowsky presents where he knows the end point but not the steps is weak to me. I'm not asking him to spell out the exact steps to AGI -- what would be persuasive to me are predictions based on his model that would be true if what he believes about intelligence etc. were true. It doesn't really matter too much if they are the steps to AGI or something entirely unrelated, since I agree that if all his (and your) assumptions about key points hold true then doom does likely follow. But I think we disagree on many key points/cruxes/model elements/whatever.
This to me is very much in line with the strength of economics -- it makes tons of falsifiable predictions, and part of what makes the Hayekian stuff impressive is not just that it makes sense in your head but it happens in the real world. For Yudkowsky I feel like I am getting the part here where it makes sense in my head but I am missing the part where it makes predictions about the real world -- and not just "here are the exact steps where we get to AGI" but predictions of the form like "if X about intelligence/competitive advantage/creativity/scaling is true then Y will likely follow, and if Y doesn't happen I will lower my confidence in X". Because I certainly have those, like if AI starts transforming the economy exponentially within a few years I will change my model a lot. But I get the sense that very little can happen that will change Yudkowsky's model substantially in his mind, and for me that is a huge weakness, and reminds me of a lot of failed ways of pre-enlightenment human attempts to do good reasoning.
That said, do appreciate your concrete predictions in the Auto-GPT post, that is absolutely what is missing from the Yudkowsky stuff to me. But disagree with you/Yudkowsky in the interview section in this one.
I understand that's the Yudkowsky model's take on AGI specifically, but it seems to me that it should be able to make some other predictions that, if the model's assumptions about things like intelligence and competition and so on were correct, would be falsifiable regardless of if they are specifically about AI progress (which supposedly has no fire alarms, sharp left turns etc so it's unfalsifiable). If the model is completely unable to make any falsifiable prediction about the world based on its assumptions then I think that is a reasonable critique of the model itself.
I agree that it being unable to make falsifiable predictions is a reasonable critique but I don't think that's a 'fatal penalty'.
I think it's more akin to a 'model' claiming that you personally can survive skydiving without using a parachute (or anything similar). That's falsifiable – and in the same way that Yudkowsky's model is – but it's not falsifiable to you, i.e. because you reasonably don't expect to _survive_ long enough to discover that the prediction is false.
I actually disagree with Kenny here, in the sense that given your hypotheses I do expect to be able to find questions that differentiate between our two models, because you've staked out a bunch of concrete other positions where I can look for the divergence where I expect that divergence to come earlier. I'd have to think about it for a while to figure out what they'd be in detail. Thing is, I doubt those differences would (deserve to) convince most others with different issues.
What I was trying to establish with the doom/econ section was an existence proof by example, to be clear - not that it's virtuous or good to not be able to make earlier predictions, it's definitely not great on many levels including convincing other people. More that it makes perfect sense to know something that allows you to predict the end without being helpful in predicting the middle.
The problem is, I can make tons of predictions and I think my prediction record is pretty good, but those predictions don't causally come from the places I need to convince people on these other points - what I'd be doing is more like 'I am better at predicting than you, which is evidence my other prediction is right' rather than 'X -> A+B+C, I asserted X and predicted A+B+C, and look, A+B, so C.' Which is much better.
So it's a pickle. My guess is that if EY took a bunch of effort to make predictions, even if the overall record looked good, (1) people would just cherry-pick or distort a lot and also more importantly (2) they would call out the predictions as not actually being that connected to the conclusions, so they'd say 'sure you can predict short term changes by being smart, that doesn't mean we have to buy your crazy doom argument.' And they'd be pretty right, there.
Also, this isn't 'fair' or anything, but given how much EY has described his model: Should be possible for others to find things it predicts?
Makes sense. And yes, I was thinking about the last bit -- although not sure what would be a good one. IQ already does seem to predict wealth, so in a sense that is points for the EY model of intelligence = good resource accumulator (although I have some skepticism about what IQ is measuring).
Purely spitballing here so these are dumb, but maybe something like in the next 3 years nations with higher IQ leaders will have a higher GDP delta than nations with lower IQ leaders? No idea how you'd find out people's IQs though. Or maybe try to find some other similar question that helps show predictive evidence for intelligence = massive competitive edge. Or maybe some sort of prediction that shows that GPT-4 is already superhuman at collaboration and so has an edge that way regardless of intelligence? No idea, it's hard to generate them without holding the model in my head, just trying to steelman it is a bit challenging for me.
And yeah, I do hear you that everyone has different issues, and so maybe convincing individual people is a fool's errand. But it's also hard for me to be convinced any other way! So not sure what to make of all that.
> I have some skepticism about what IQ is measuring
Yudkowsky is definitely NOT conflating whatever 'g' is (i.e. what IQ very much seems to be measuring) with 'intelligence'. He tweeted semi-recently (?) something along the lines 'Of course charisma is (part of) intelligence, i.e. a brain function – where else is that function performed if not the brain?'.
I think the history of AI is somewhat clearer about that than the recent notable progress, e.g. visual perception IS a kind/form of intelligence, even if the standard/colloquial sense of 'intelligence' doesn't include those kinds of 'mundane' functioning. But there IS an important sense in which the most 'dangerous' form of intelligence is something like 'reasoning' and less like, e.g. learning via hand-eye coordination.
Thanks for keeping these up, they're now my main source of AI-related content.
I was wondering semi-jokingly (but also semi-seriously) whether there would be value in alignment-concerned individuals that may not be able to contribute directly to alignment research to infiltrate entities that are working on capability advancement such as OpenAI/Anthropic etc and attempt to exfiltrate information in relation to progress and capabilities as a means of a sort of an early warning system about behind the scenes shenanigans, companies not sticking to their alignment commitments etc. Sort of a network of alignment whistleblowers if you will.
Probably a stupid thought but the desperation bag of ideas is getting pretty empty at this point - not that it was ever very full to start with.
Given one is going to work at such places anyway, it is good to do such things.
My prior is that if you join such things to do capabilities work in order to raise such alarms in the future, you are fooling yourself and will end up making things net worse.
I think my thinking on this (which perhaps wasn't clear enough) was restricted specifically to people who do not have the skills to contribute directly to either alignment or capability work but might be able to find roles in such organisations that are privy to that sort of information. An example would be technical operations roles, though I imagine even product managers and sales people would potentially become aware of such bits of information that are worth exfiltrating. I guess indirectly they would still be contributing to capabilities work since that's what the companies are geared towards at present, but I'm weighing whether having the ability to exfiltrate useful safety information might counter-weight the capabilities contribution one has in a non-direct role.
Anyway it's a pretty wild line of thinking but as you say IF one is going to work at such places anyway exfiltrating information could at least serve to mitigate some of the harm done.
There is independent support for Eliezer's assessment that LLM research currently doesn't involve anything especially deep, but does require overcoming a shedload of technical difficulties with short half lives. For instance, appendix A of the GLM-130B paper from 6 months ago https://arxiv.org/abs/2210.02414 contains an awesome list of unsuccessful things they tried and describes how they were overcome.
1) The gesture at bar for implicit knowledge makes demonstrating it in a discussion really hard, as identifying 'X is a kind of implicit knowledge' is essentially gesturing at X. Thinking for just a second, there seems to be two ways we could identify the existence of implicit knowledge. First, by looking at the past we could try and identify factors that accounted for divergent performance that were not known or identified at the time but have since been recognized. This is still gesturizable though, just in hindsight (which may actually be what a valid gatekeeping attempt is looking for?). Second, we could try a sort of 'Total Factor Productivity approach' where we attempt to account for differences in performance between individuals after removing the effect of all gesturizable knowledge. This seems quite hard to do in practice.
2) One feature of the AI debate is that there is this gap between those like EY (and I think yourself?) who regard fatal AI as being overdetermined versus others who don't have a belief about the nature of future AI (or some who think benevolent AI is the attractor state). Its interesting to me how much of this factor seems to be determined by beliefs about what humans are. Like the overdetermined view seems strongly implied by the beliefs that humans are only accidentally special, our intellect is some dumb kludge, its physical substrate is kinda meh, and our values and preferences are the accidents of a random process (I mean look how long it took people to develop our present civilization and how dumb that civilization still is! Computers are an obviously superior substrate, as referenced above, much better algorithms then us likely exist, and the only values that aren't accidental are the dangerous instrumental ones). Whereas others either haven't integrated their (formal) views on humans with their thoughts about AIs, they don't think we currently understand human specialness, or they think humans are special in some way that is in tension with a straightforward extrapolation from evolutionary biology.
For example, I think Sarah Constantin's 'Why I'm not an AI Doomer' article can be in part understood as a claim that humans really are special reasoners in a way that is not understood, and that implies that fatal AI is not imminent.
This is definitely not an original thought, but one limitation of the AINotKillEveryoneism prologues I've read is that they generally seem to cover the issue with a quick gloss of the relevant big picture science claims (i.e. everything is physics including you, biology optimizes in a crude way ect.) and don't really engage with doubt that comes from the weakness of present explanations about human specialness.
**Appendix**
Just for clarity here are how the other views on specialness seem to cash out in thinking about fatal AI being overdetermined:
*Ignorance about specialness*
The point being that each of these ends up undermining overdetermined death, not that people believe all of them:
Dumbness) Human intelligence seems really strong (we make the internet and nukes!), with few clear limits and unexplained features. Maybe we are already really smart?
Base Learning Capacity) We don't understand the base algorithm (if that is even the right term?) for human intelligence, maybe it is optimal and AI's won't be superior?
Brains) Maybe there is something special about brains that makes their limitations in compared to computers a correct one?
Values) Maybe our values reflect limitations in value space and stable values have to lie in human value land?
*Real Specialness*
Dumbness) Humans have some innate special quality to understand that can't be physically built.
Base Capacity) Human intelligence and understanding are connected to some special quality, so AIs can only approximate but can't equal human capacity fully
Brains/Substrate) The brain/basis for human intelligence is doing something special that can't be replicated.
Values) Moral values are real and any non-perverse being will follow those real moral values.
In future, please split such comments up - in general anything this long that's addressing multiple things is better off as multiple comments.
As for the content, (1) the gesturing thing only came up exactly because people were claiming things and refusing to gesture at them in exactly such ways. So dunno what to do with that. (2) This is a clear case of 'different people need different explanations and you lose them if you spend too long on the wrong things' and also it's hard to know what ever constitutes a real crux that would change minds. I'm actually considering a poll series to explore these questions.
And yes I think the default should definitely be 'humans are the minimum viable kludge to intelligence of this type, using what was lying around to be used' rather than us being super special, and that whatever we have that AIs currently don't in terms of intelligence can be copied if needed, but this is the wrong place to make an extended case.
Wolfram isn't concerned about existential risk from AI either
Or John Carmack
You have a stray "If one has to do this with" paragraph.
Thanks, fixed.
I'd like to offer my thoughts on a point Eliezer raised regarding knowledge practitioners possess that he may not have access to. I think I've found an example. Reinforcement Learning sucks, like a lot. Major accomplishments still depend on Proximal Policy Optimization (PPO) from 2017. The algos are hilariously sample-dependent, require near-perfect simulations of the test environment (video games, GO) and are very brittle.
The field continues to optimize for Atari, which essentially involves training and testing on the same data. Improvements in these benchmarks do not appear to be transferable to downstream tasks, as they primarily reflect better memorization of the environments. Projects like AutoGPT represent naïve implementations of agents, and the next logical step is to apply RL to them and be dissapointed. A truly effective RL algorithm would be alarming and lend support to concerns about rapid advancements, but for now, supervised learning remains the dominant approach. Reinforcement Learning with Human Feedback is mainly a gimmick, and similar results have been achieved using simple supervised fine-tuning.
Whenever I come across catastrophic scenarios that involve any form of model feedback incorporating RL, I can't help but think that's what an outsider might be missing—they are unaware that RL actually sucks.
I want to distinguish (A) "imminent AI catastrophe requiring no new AI ideas" from (B) "AI catastrophe more generally". Your comment seems to be saying that (A)-scenarios involving RL are implausible, but that (B)-scenarios involving RL are totally on the table. ("A truly effective RL algorithm would be alarming…"). If so, I agree with both. And I think Eliezer agrees with both too, from what I gather. So I'm not exactly sure who you're responding to.
For example, the human brain centrally involves RL—much more centrally than does the LLM+RLHF paradigm. And it seems to be not only highly effective for us, but IMO a very important ingredient to human capabilities. So for my part, I put a lot of stock in (B)-scenarios involving RL. (And it's hard to rule out that (B)-scenarios involving RL could happen soon; people are trying every day to make RL work better, and we should be open-minded to the possibility that they might succeed.)
Rereading there definitely was some level of screaming in the void in my original comment. The general open forum discussions are pretty frustrating and every AI space are now flooded by such discussions.
As for point B, would you say that your estimation of it happening in the near future has altered in 2023?
From my perspective, it appears that the field has been somewhat stagnant for the past five years, gradually increasing its capacity through scaling laws. The recent introduction of what is essentially a GUI for half a decade of scaling law work has created an illusion of rapid changes occurring within just a few months, but the changes from GPT-3 to GPT-4 are pretty minor compared to what GPT-2 was to GPT-1 or GPT-3 to GPT-2. Scaling laws remain a power-law, and theres only so much good data and compute.
There's a scenario "In the near future (say 5 yrs), someone makes an AI system that centrally involves LLMs (or their direct successors) but also maybe has new AI ideas, and it gets sufficiently powerful to pose an x-risk". This is not the scenario that I spend most of my time thinking about, but it's hard to rule out, because "new AI ideas" is a very big space of possibilities, including possibilities that haven't occurred to me. (And people haven't been messing around with LLMs long enough for all the low-hanging fruit to be already picked, I think.) Yeah, I do think I take that scenario more seriously than I did a year ago, even if it's still not my main expectation. (My main expectation is still that LLMs (and LLM-centric systems) will plateau before they pass the x-risk threshold.)
(Maybe it's long been obvious to you that LLMs can get to where GPT4 is, but that wasn't obvious to me. Granted, I wasn't thinking about it much. But anyway, that's why I take the scenario more seriously than a year ago. That plus peer pressure from all the other alignment researchers. Oops, did I say peer pressure? I meant: "some of my reasonable-seeming colleagues, who have invested much more time than me into thinking about LLMs, seem to take that scenario very seriously, so I should at least stay open-minded to it". Shrug.)
Another week, another roundup, another surge of existential dread
Man, I really really hope those scaling curves start bending, cause I frankly can't see any way out.
I want to push back a bit against the section about being able to predict the end (because all roads lead to Rome) but not being able to predict details about the journey, because I have a pretty different lens that I default to about this than the one that you do.
In my view, basically what we are trying to do is develop models of the world or parts of the world and then develop hypotheses that draw on those models and test them against each other. My model has all sorts of assumptions -- intelligence is messy, there's no such thing as G, species' competitive advantage might come from many places, AI might become smart in some ways while remaining very dumb for a long time in others, etc etc. Your model has some similar assumptions to mine (like AI will become superhuman in some ways soon, AI is very impressive already etc) but many differences.
Most of what I try to do when I read these posts is try to update my model based on contact with your model and thinking through your arguments. But it is very hard to make any strong changes to my model without falsifiable hypotheses to test against each other. Very much to your credit, you do tend to make a lot more falsifiable hypotheses than Yudkowsky. But I think that the argument that Yudkowsky presents where he knows the end point but not the steps is weak to me. I'm not asking him to spell out the exact steps to AGI -- what would be persuasive to me are predictions based on his model that would be true if what he believes about intelligence etc. were true. It doesn't really matter too much if they are the steps to AGI or something entirely unrelated, since I agree that if all his (and your) assumptions about key points hold true then doom does likely follow. But I think we disagree on many key points/cruxes/model elements/whatever.
This to me is very much in line with the strength of economics -- it makes tons of falsifiable predictions, and part of what makes the Hayekian stuff impressive is not just that it makes sense in your head but it happens in the real world. For Yudkowsky I feel like I am getting the part here where it makes sense in my head but I am missing the part where it makes predictions about the real world -- and not just "here are the exact steps where we get to AGI" but predictions of the form like "if X about intelligence/competitive advantage/creativity/scaling is true then Y will likely follow, and if Y doesn't happen I will lower my confidence in X". Because I certainly have those, like if AI starts transforming the economy exponentially within a few years I will change my model a lot. But I get the sense that very little can happen that will change Yudkowsky's model substantially in his mind, and for me that is a huge weakness, and reminds me of a lot of failed ways of pre-enlightenment human attempts to do good reasoning.
That said, do appreciate your concrete predictions in the Auto-GPT post, that is absolutely what is missing from the Yudkowsky stuff to me. But disagree with you/Yudkowsky in the interview section in this one.
I'm not sure there's anything that would change your mind before it's too late.
I don't think we'll live to see "AI starts transforming the economy exponentially within a few years" before we've made a fatal mistake.
I understand that's the Yudkowsky model's take on AGI specifically, but it seems to me that it should be able to make some other predictions that, if the model's assumptions about things like intelligence and competition and so on were correct, would be falsifiable regardless of if they are specifically about AI progress (which supposedly has no fire alarms, sharp left turns etc so it's unfalsifiable). If the model is completely unable to make any falsifiable prediction about the world based on its assumptions then I think that is a reasonable critique of the model itself.
I agree that it being unable to make falsifiable predictions is a reasonable critique but I don't think that's a 'fatal penalty'.
I think it's more akin to a 'model' claiming that you personally can survive skydiving without using a parachute (or anything similar). That's falsifiable – and in the same way that Yudkowsky's model is – but it's not falsifiable to you, i.e. because you reasonably don't expect to _survive_ long enough to discover that the prediction is false.
I actually disagree with Kenny here, in the sense that given your hypotheses I do expect to be able to find questions that differentiate between our two models, because you've staked out a bunch of concrete other positions where I can look for the divergence where I expect that divergence to come earlier. I'd have to think about it for a while to figure out what they'd be in detail. Thing is, I doubt those differences would (deserve to) convince most others with different issues.
What I was trying to establish with the doom/econ section was an existence proof by example, to be clear - not that it's virtuous or good to not be able to make earlier predictions, it's definitely not great on many levels including convincing other people. More that it makes perfect sense to know something that allows you to predict the end without being helpful in predicting the middle.
The problem is, I can make tons of predictions and I think my prediction record is pretty good, but those predictions don't causally come from the places I need to convince people on these other points - what I'd be doing is more like 'I am better at predicting than you, which is evidence my other prediction is right' rather than 'X -> A+B+C, I asserted X and predicted A+B+C, and look, A+B, so C.' Which is much better.
So it's a pickle. My guess is that if EY took a bunch of effort to make predictions, even if the overall record looked good, (1) people would just cherry-pick or distort a lot and also more importantly (2) they would call out the predictions as not actually being that connected to the conclusions, so they'd say 'sure you can predict short term changes by being smart, that doesn't mean we have to buy your crazy doom argument.' And they'd be pretty right, there.
Also, this isn't 'fair' or anything, but given how much EY has described his model: Should be possible for others to find things it predicts?
Makes sense. And yes, I was thinking about the last bit -- although not sure what would be a good one. IQ already does seem to predict wealth, so in a sense that is points for the EY model of intelligence = good resource accumulator (although I have some skepticism about what IQ is measuring).
Purely spitballing here so these are dumb, but maybe something like in the next 3 years nations with higher IQ leaders will have a higher GDP delta than nations with lower IQ leaders? No idea how you'd find out people's IQs though. Or maybe try to find some other similar question that helps show predictive evidence for intelligence = massive competitive edge. Or maybe some sort of prediction that shows that GPT-4 is already superhuman at collaboration and so has an edge that way regardless of intelligence? No idea, it's hard to generate them without holding the model in my head, just trying to steelman it is a bit challenging for me.
And yeah, I do hear you that everyone has different issues, and so maybe convincing individual people is a fool's errand. But it's also hard for me to be convinced any other way! So not sure what to make of all that.
> I have some skepticism about what IQ is measuring
Yudkowsky is definitely NOT conflating whatever 'g' is (i.e. what IQ very much seems to be measuring) with 'intelligence'. He tweeted semi-recently (?) something along the lines 'Of course charisma is (part of) intelligence, i.e. a brain function – where else is that function performed if not the brain?'.
I think the history of AI is somewhat clearer about that than the recent notable progress, e.g. visual perception IS a kind/form of intelligence, even if the standard/colloquial sense of 'intelligence' doesn't include those kinds of 'mundane' functioning. But there IS an important sense in which the most 'dangerous' form of intelligence is something like 'reasoning' and less like, e.g. learning via hand-eye coordination.
Thanks for keeping these up, they're now my main source of AI-related content.
I was wondering semi-jokingly (but also semi-seriously) whether there would be value in alignment-concerned individuals that may not be able to contribute directly to alignment research to infiltrate entities that are working on capability advancement such as OpenAI/Anthropic etc and attempt to exfiltrate information in relation to progress and capabilities as a means of a sort of an early warning system about behind the scenes shenanigans, companies not sticking to their alignment commitments etc. Sort of a network of alignment whistleblowers if you will.
Probably a stupid thought but the desperation bag of ideas is getting pretty empty at this point - not that it was ever very full to start with.
Given one is going to work at such places anyway, it is good to do such things.
My prior is that if you join such things to do capabilities work in order to raise such alarms in the future, you are fooling yourself and will end up making things net worse.
I think my thinking on this (which perhaps wasn't clear enough) was restricted specifically to people who do not have the skills to contribute directly to either alignment or capability work but might be able to find roles in such organisations that are privy to that sort of information. An example would be technical operations roles, though I imagine even product managers and sales people would potentially become aware of such bits of information that are worth exfiltrating. I guess indirectly they would still be contributing to capabilities work since that's what the companies are geared towards at present, but I'm weighing whether having the ability to exfiltrate useful safety information might counter-weight the capabilities contribution one has in a non-direct role.
Anyway it's a pretty wild line of thinking but as you say IF one is going to work at such places anyway exfiltrating information could at least serve to mitigate some of the harm done.
There is independent support for Eliezer's assessment that LLM research currently doesn't involve anything especially deep, but does require overcoming a shedload of technical difficulties with short half lives. For instance, appendix A of the GLM-130B paper from 6 months ago https://arxiv.org/abs/2210.02414 contains an awesome list of unsuccessful things they tried and describes how they were overcome.
Couple comments.
1) The gesture at bar for implicit knowledge makes demonstrating it in a discussion really hard, as identifying 'X is a kind of implicit knowledge' is essentially gesturing at X. Thinking for just a second, there seems to be two ways we could identify the existence of implicit knowledge. First, by looking at the past we could try and identify factors that accounted for divergent performance that were not known or identified at the time but have since been recognized. This is still gesturizable though, just in hindsight (which may actually be what a valid gatekeeping attempt is looking for?). Second, we could try a sort of 'Total Factor Productivity approach' where we attempt to account for differences in performance between individuals after removing the effect of all gesturizable knowledge. This seems quite hard to do in practice.
2) One feature of the AI debate is that there is this gap between those like EY (and I think yourself?) who regard fatal AI as being overdetermined versus others who don't have a belief about the nature of future AI (or some who think benevolent AI is the attractor state). Its interesting to me how much of this factor seems to be determined by beliefs about what humans are. Like the overdetermined view seems strongly implied by the beliefs that humans are only accidentally special, our intellect is some dumb kludge, its physical substrate is kinda meh, and our values and preferences are the accidents of a random process (I mean look how long it took people to develop our present civilization and how dumb that civilization still is! Computers are an obviously superior substrate, as referenced above, much better algorithms then us likely exist, and the only values that aren't accidental are the dangerous instrumental ones). Whereas others either haven't integrated their (formal) views on humans with their thoughts about AIs, they don't think we currently understand human specialness, or they think humans are special in some way that is in tension with a straightforward extrapolation from evolutionary biology.
For example, I think Sarah Constantin's 'Why I'm not an AI Doomer' article can be in part understood as a claim that humans really are special reasoners in a way that is not understood, and that implies that fatal AI is not imminent.
This is definitely not an original thought, but one limitation of the AINotKillEveryoneism prologues I've read is that they generally seem to cover the issue with a quick gloss of the relevant big picture science claims (i.e. everything is physics including you, biology optimizes in a crude way ect.) and don't really engage with doubt that comes from the weakness of present explanations about human specialness.
**Appendix**
Just for clarity here are how the other views on specialness seem to cash out in thinking about fatal AI being overdetermined:
*Ignorance about specialness*
The point being that each of these ends up undermining overdetermined death, not that people believe all of them:
Dumbness) Human intelligence seems really strong (we make the internet and nukes!), with few clear limits and unexplained features. Maybe we are already really smart?
Base Learning Capacity) We don't understand the base algorithm (if that is even the right term?) for human intelligence, maybe it is optimal and AI's won't be superior?
Brains) Maybe there is something special about brains that makes their limitations in compared to computers a correct one?
Values) Maybe our values reflect limitations in value space and stable values have to lie in human value land?
*Real Specialness*
Dumbness) Humans have some innate special quality to understand that can't be physically built.
Base Capacity) Human intelligence and understanding are connected to some special quality, so AIs can only approximate but can't equal human capacity fully
Brains/Substrate) The brain/basis for human intelligence is doing something special that can't be replicated.
Values) Moral values are real and any non-perverse being will follow those real moral values.
In future, please split such comments up - in general anything this long that's addressing multiple things is better off as multiple comments.
As for the content, (1) the gesturing thing only came up exactly because people were claiming things and refusing to gesture at them in exactly such ways. So dunno what to do with that. (2) This is a clear case of 'different people need different explanations and you lose them if you spend too long on the wrong things' and also it's hard to know what ever constitutes a real crux that would change minds. I'm actually considering a poll series to explore these questions.
And yes I think the default should definitely be 'humans are the minimum viable kludge to intelligence of this type, using what was lying around to be used' rather than us being super special, and that whatever we have that AIs currently don't in terms of intelligence can be copied if needed, but this is the wrong place to make an extended case.