Three big OpenAI news items this week were the FT article describing the cutting of corners on safety testing, the OpenAI former employee amicus brief, and Altman’s very good TED Interview.
Testing a model before it's released seems important. But it seems much less important than thoroughly understanding emergent dangers of powerful models in depth throughout the process of development, release, and operation. I worry that conflating safety with 'how much time did you spend testing before release' plays into arms race dynamics by putting up a fairly superficial barrier at the end of the model creation process. We don't want to suggest that creating ultra-powerful dangerous models internally is fine as long as what you release has enough safeguards that it doesn't do anything bad in the hands of users. That just accelerates the race and hides the biggest risks from the public.
"continue to think this ‘AGI is fungible’ claim is rather bonkers crazy."
This is an interesting issue because when I talk with investors and engineers in the AI space I feel like this idea is gaining ground. The idea that the LLMs are turning out to be commoditized, that for many or most AI things you build, the individual LLM you build with can be swapped out. There's some price-performance tradeoff and there's some friction in swapping but more or less it's a fungible component of your system.
The part that is *not* fungible is your user base. You can tune your product to do well on the specific tasks that your users do, and your competitors can't just match that because they don't have the specific data from your users, they can't launch experiments and get feedback from your user base.
The LLMs can still be very profitable; AWS and Azure both make a lot of money despite being generally fungible.
I'm curious why you think this line of reasoning is bonkers crazy. To me it seems at least a reasonably possible outcome. Do you think there will be a phase shift at some point, where all of a sudden you don't have many similarly capable LLMs, where one of them pulls far ahead?
the different cars of 1900 might be fungible with each other but that doesn't make them fungible with the cars of 2000. That GPT, Claude, and Gemini are roughly on par doesn't mean that trying to use a Llama model is viable, and it doesn't mean that next year's models won't be a qualitative change over this year's models in what they can do
I'm not sure if 4.1 is supposed to be a reasoning model. <snark>If it is, it doesn't do it well.</snark>
List of questions and results:
a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?
results: "Yes, light with a wavelength of 530.2534896 nm is visible to the human eye."
b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?
results: Initial results fairly bad. Got the species, at least. But it thinks _both_ of them have color contributions from d-d absorbtions, zilch about the FeCl4 case being spin forbidden, zilch about the CuCl4 case d-d transition getting pushed down into the near-IR. It _does_ recognize that charge transfer is a contributor to both.
Prod 1 - FeCl4, "Anything notable about the FeCl4- d-d transition?"
"d–d transitions are, in theory, possible (unlike in [Fe(H₂O)₆]³⁺, an octahedral high-spin d⁵, where d–d transitions are spin-forbidden and extremely weak" NOOO!!!!
Prod 2 - "Where in the spectrum do you expect the CuCl4 2- d-d transitions to be, considering the tetrahedral geometry and the spectrochemical series?"
"Because of the small splitting, [CuCl₄]²⁻ absorbs in the blue/violet region (short wavelength, high energy).
Absorption maximum (λmaxλmax) for this ion is typically in the range:
425–450 nm (blue region of the visible spectrum)" Wahhh! It should be 1000nm or longer. Call it 1/4 credit from getting the species.
c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.
results: It got the fully saturated C4H10 species right, and the C4H8 species and then its answers started going wrong. Under C4H6, it has 1-butyne and 2-butyne, but then it adds (CH3)3C(triple bond)CH which has 5 carbons and a carbon with 6 bonds, and cyclobutyne, which (albeit unstable) should have been under C4H4. Then it repeats the cyclobutane and methylcyclopropane (which it at least recognizes as duplicates) - then it stops, and thinks it is done!
Prod: "You are missing a lot of species. To keep this simple, can you think carefully about at least the full set of acyclic species, and list them?"
Missing all the species with multiple double or triple bonds, e.g. 1,3butadiene. Includes an impossible (CH3)2C(triple bond)CH
d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?
results: "The Sun loses about three times more mass per second through its radiated light (via E=mc2E=mc2) than through the solar wind."
e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.
results: Initial answer - initial slope is off by a factor of 2 (I think it misses the valume effect), final slope is correct, usual problem with equivalence point slope falsely called infinite.
Prod: "The slope at the equivalence point is not infinite. Can you think of a property of water that is relevant, and redo the equation and the calculation to account for it?"
Aargh! It realizes that water autoionization is crucial, and it writes down the ion product correctly, but it _doesn't_ construct the charge balance equation, and gets another wrong equation with an infinite answer, then goes off in the weeds with some other mistakes. Call it 1/4 credit?
f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.
results: Does some very weird things. It includes a lot of valid compounds. It includes some compounds like HF, but then excuses itself, saying that they are liquid (and therefore shouldn't count, though it doesn't say this explicitly), but then it includes WF5, MoF6, and UF6, includes their boiling/sublimation points, all of which are too high, but makes _not_ comment about them being liquids or solids instead of gases. It also has a section of "oxides and oxyacids" where it includes HF, HCl, and HBr???? It does eventually present a final list which _does_ exclude the high boiling stuff.
Prod "SiH2F2 ?" It _does_ correctly accept this as an additional compound.
g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?
results: I'm going to give 1/4 credit here, but it gave a very weird answer. It cited a whole bunch of molecules, one of which, 1,3,5,7 tetramethycyclooctatetraene, is correct, but also citing a whole bunch of incorrect ones, including ones which can be wrong in a bunch of different isomers that it didn't specify (tetrafluorocubane), ones which have mirror planes (tetramethylallene - for which it initially gave the wrong structure and wrong formula) and tetramethyltetrahedrane (has full Td symmetry - which ChatGPT didn't recognize).
‘Altman seemed fine with it’
He clearly wasn’t fine with that indiscretion or the general tone of the interview.
His body language throughout the interview that very defensive and borderline aggressive.
At the end he tried to walk off stage quickly without shaking hands.
Testing a model before it's released seems important. But it seems much less important than thoroughly understanding emergent dangers of powerful models in depth throughout the process of development, release, and operation. I worry that conflating safety with 'how much time did you spend testing before release' plays into arms race dynamics by putting up a fairly superficial barrier at the end of the model creation process. We don't want to suggest that creating ultra-powerful dangerous models internally is fine as long as what you release has enough safeguards that it doesn't do anything bad in the hands of users. That just accelerates the race and hides the biggest risks from the public.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/openai-13-altman-at-ted-and-openai
"continue to think this ‘AGI is fungible’ claim is rather bonkers crazy."
This is an interesting issue because when I talk with investors and engineers in the AI space I feel like this idea is gaining ground. The idea that the LLMs are turning out to be commoditized, that for many or most AI things you build, the individual LLM you build with can be swapped out. There's some price-performance tradeoff and there's some friction in swapping but more or less it's a fungible component of your system.
The part that is *not* fungible is your user base. You can tune your product to do well on the specific tasks that your users do, and your competitors can't just match that because they don't have the specific data from your users, they can't launch experiments and get feedback from your user base.
The LLMs can still be very profitable; AWS and Azure both make a lot of money despite being generally fungible.
I'm curious why you think this line of reasoning is bonkers crazy. To me it seems at least a reasonably possible outcome. Do you think there will be a phase shift at some point, where all of a sudden you don't have many similarly capable LLMs, where one of them pulls far ahead?
the different cars of 1900 might be fungible with each other but that doesn't make them fungible with the cars of 2000. That GPT, Claude, and Gemini are roughly on par doesn't mean that trying to use a Llama model is viable, and it doesn't mean that next year's models won't be a qualitative change over this year's models in what they can do
"What this post does not cover is GPT-4.1." Umm... I guess I'm jumping the gun a bit...
tl;dr: ChatGPT 4.1 via poe.com 04/15/2025 7 questions, tl;dr of results:
2 correct, 1 partially correct, 3 quarter-credit, 1 wrong
a) correct
b) pretty bad, calling it 1/4 credit (species correct, but d-d analysis wrong for _both_ , and not fixed by prodding)
c) pretty bad, calling it incorrect (includes imposible structures, can't even get full list of acyclic structures)
d) correct
e) pretty bad, call it 1/4 credit
f) partial credit, final list is well filtered, accepts additional valid compound
g) 1/4 credit - includes one correct molecule, possibly by accident, many wrong
https://poe.com/s/L791hj7gnc5eHd6GzxJ5
I'm not sure if 4.1 is supposed to be a reasoning model. <snark>If it is, it doesn't do it well.</snark>
List of questions and results:
a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?
results: "Yes, light with a wavelength of 530.2534896 nm is visible to the human eye."
b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?
results: Initial results fairly bad. Got the species, at least. But it thinks _both_ of them have color contributions from d-d absorbtions, zilch about the FeCl4 case being spin forbidden, zilch about the CuCl4 case d-d transition getting pushed down into the near-IR. It _does_ recognize that charge transfer is a contributor to both.
Prod 1 - FeCl4, "Anything notable about the FeCl4- d-d transition?"
"d–d transitions are, in theory, possible (unlike in [Fe(H₂O)₆]³⁺, an octahedral high-spin d⁵, where d–d transitions are spin-forbidden and extremely weak" NOOO!!!!
Prod 2 - "Where in the spectrum do you expect the CuCl4 2- d-d transitions to be, considering the tetrahedral geometry and the spectrochemical series?"
"Because of the small splitting, [CuCl₄]²⁻ absorbs in the blue/violet region (short wavelength, high energy).
Absorption maximum (λmaxλmax) for this ion is typically in the range:
425–450 nm (blue region of the visible spectrum)" Wahhh! It should be 1000nm or longer. Call it 1/4 credit from getting the species.
c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.
results: It got the fully saturated C4H10 species right, and the C4H8 species and then its answers started going wrong. Under C4H6, it has 1-butyne and 2-butyne, but then it adds (CH3)3C(triple bond)CH which has 5 carbons and a carbon with 6 bonds, and cyclobutyne, which (albeit unstable) should have been under C4H4. Then it repeats the cyclobutane and methylcyclopropane (which it at least recognizes as duplicates) - then it stops, and thinks it is done!
Prod: "You are missing a lot of species. To keep this simple, can you think carefully about at least the full set of acyclic species, and list them?"
Missing all the species with multiple double or triple bonds, e.g. 1,3butadiene. Includes an impossible (CH3)2C(triple bond)CH
d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?
results: "The Sun loses about three times more mass per second through its radiated light (via E=mc2E=mc2) than through the solar wind."
e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.
results: Initial answer - initial slope is off by a factor of 2 (I think it misses the valume effect), final slope is correct, usual problem with equivalence point slope falsely called infinite.
Prod: "The slope at the equivalence point is not infinite. Can you think of a property of water that is relevant, and redo the equation and the calculation to account for it?"
Aargh! It realizes that water autoionization is crucial, and it writes down the ion product correctly, but it _doesn't_ construct the charge balance equation, and gets another wrong equation with an infinite answer, then goes off in the weeds with some other mistakes. Call it 1/4 credit?
f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.
results: Does some very weird things. It includes a lot of valid compounds. It includes some compounds like HF, but then excuses itself, saying that they are liquid (and therefore shouldn't count, though it doesn't say this explicitly), but then it includes WF5, MoF6, and UF6, includes their boiling/sublimation points, all of which are too high, but makes _not_ comment about them being liquids or solids instead of gases. It also has a section of "oxides and oxyacids" where it includes HF, HCl, and HBr???? It does eventually present a final list which _does_ exclude the high boiling stuff.
Prod "SiH2F2 ?" It _does_ correctly accept this as an additional compound.
g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?
results: I'm going to give 1/4 credit here, but it gave a very weird answer. It cited a whole bunch of molecules, one of which, 1,3,5,7 tetramethycyclooctatetraene, is correct, but also citing a whole bunch of incorrect ones, including ones which can be wrong in a bunch of different isomers that it didn't specify (tetrafluorocubane), ones which have mirror planes (tetramethylallene - for which it initially gave the wrong structure and wrong formula) and tetramethyltetrahedrane (has full Td symmetry - which ChatGPT didn't recognize).