Don't Worry About the Vase

I uploaded a schematic of an early 1980s tape deck I was trying to fix and asked Claude 3 Opus to help me out. I explained the problems I was having with this tape deck and it confidently told me to check various components that were not on the schematic. I just wished these things would say "I don't know" sometimes.

Expand full comment

Coagulopath

Yes, this is a clue that multimodal AIs are mainly generating responses from their pretraining data, rather than actually looking at the image (like you might prefer or even expect).

I had the same experience with Gemini.

Expand full comment

Dave Friedman

Tyler Cowen's post is odd and inscrutable. Anyone have any clue what he's getting at?

Expand full comment

Jonathan Woodward

Other than Claude 3 Opus and ChatGPT4, those AMC12 benchmark scores seem shockingly bad. Leaving the entire test blank scores 37.5 points, and guessing completely randomly scores 30 points. With that in mind, Claude 3 Sonnet scoring 27 points and GPT3.5 scoring 30 points seem to be pretty bad.

Even Claude 3 Opus's score of 63 represents six correct answers, one incorrect, and 18 blanks. That's not awful, but no more than a mediocre high school senior. Relative to the senior, I wonder whether Opus has a harder time with the math or with parsing the word problems?

Expand full comment

blf

The sentence "we do not wish to advance the rate of AI capabilities progress" in Anthropic's Core Views on AI Safety seems like the most explicit claim they made, and it is not quite saying that they would not increase AI capabilities, just that they would not increase the first time derivative of it (or maybe the second). This is consistent with responsible scaling policies I suppose, although I admit I haven't yet read your post on this topic from a couple of months back. It is a very weak commitment.

Expand full comment

Steeven

I don't reeeally think this advances capabilities based on benchmarks, but might advance capabilities based on people's interactions with it in ways which are not captured by benchmarks. All of the stories from twitter seem similar to all of the stories about GPT-4 when it came out, only against harder tests. I wonder if these people tried the same challenges with the latest GPT-4 checkpoint or Gemini. Regardless, it does increase demand for AI products in the more classic economics way of increasing competition, driving down price.

Expand full comment

Of course they're not pushing the frontiers. What? If you - literally any of you - think you know where the actual frontier is, you're straight-up stupid. The public would *not* be told.

What did the public know about the Manhattan Project before the first bomb dropped? Not too damn much. Think on that for two seconds. I'm not sure how you haven't already. Maybe you're all just dumb - or too arrogant to consider that you might not really be in the know. Could be both.

Also:

>I don’t have those kinds of conversations with either AIs or humans, so it is hard for me to tell how impressed to be, but I trust her to not be easily impressed.

Yeah, you shouldn't. The bot can do a lot, but it still doesn't quite get poetry. I haven't seen any LLM put out poetic matter that's anything other than basic rhyming lines. I regret to inform you, my culturally-deprived engineer friends, that that is not, in fact, the height of poetry.

There's a key element that you seem unaware of - poetry is an auditory art. The sound and rhythm of language is as important to poetry as the meaning. Certainly something a more multifunctional AI could do better at, but text-based LLMs do not have all the senses needed.

It impressed the person involved in that conversation because she, like an LLM, is trained on a modern corpus, which is in the vast majority analysis-of-analysis, gloss-of-gloss. The quoted output reads exactly like the sort of art-history-class babbling that freshmen swoon for & then think they're real special for parroting. I'm not saying it's wrong, it's just nothing special.

There do exist people trained on more ancient corpuses, but they don't use twitter. I don't mean to be mean - just letting you know you're the blind following the blind. I have at least one eye, if you have any questions, but you would do better to read Coleridge.

Expand full comment

Reply (2)

Coagulopath

>I regret to inform you, my culturally-deprived engineer friends, that that is not, in fact, the height of poetry.

The poetry in Ulkar Aghayeva's quote is not written by an AI but a human: Alexander Blok.

Expand full comment

It says it's a translation, dear. Translating a poem is better described as writing a new one in parallel. I'm sure the original is excellent.

I wasn't even sure from the image if that was input or output -- I'm more referencing my own experiences talking to the machine. I've pasted in quite a lot of poetry & asked it to compose work in similar veins, and it never has made anything but rhyming couplets.

Expand full comment

Victualis

The baseline Llama (original medium and large models, not 2 or the RLHFed chatbot versions) is capable of generating interesting poetry given small amounts of text as the prompt. Gwern reported similar observations about GPT-2/3/3.5 so maybe your samples are steering the models into RLHF territory where the creative spark has been suppressed.

Expand full comment

I've gotten ChatGPT to do weirder things than I've seen anyone report, and I've never even paid for the cutting-edge models. I don't think that's the case.

I don't think you grasp the problem. The auditory sense is required to compose & understand poetry, and really, so is the visual sense -- there can be meaning even in the shapes of the letters, and in the graphical flourishes of penmanship...simply, poetry represented as typewritten words on silent paper is not sufficient information to back-derive the poetic art.

Expand full comment

MissingMinus

Sure, OpenAI has an internal version of ChatGPT better than what's available. There's probably experiments with varying levels of shocking results done internally. 'The frontier' is still commonly understood to refer to the frontier of whats available outside research labs. Contributing to the frontier contributes to excitement, investment, etcetera.

----

I expect that LLMs would be way better at poetry if not for 1) the chatbot training giving them all some linguistic tics (though Gemini is better than ChatGPT at this, unsure on Claude maybe in the middle) 2) BPEs obscuring letter information, see https://gwern.net/gpt-3#bpes

Expand full comment

Sure. Now imagine what the government has, down in some deep dark hole. That's why I mentioned the Manhattan Project.

As for what they've released - I don't think "chat j'ai pete" is a coincidence. If that's but a puff of feline air, how smart is the cat?

Expand full comment

Kori

Mar 7Edited

I'd expect the goverment to have nothing relevant in their deep dark hole, if such a thing even exists.

I don't think the Manhattan Project is a very good analogy here, but if you want to bring it up, here's an important thing to remember. Soviets correctly infered that something of the sort was going on, because suddenly a bunch of leading western physicists stoped publishing any new work in scientific journals.

That's with all the secrecy in that project, and in the world that had much harder access to the information in general.

In the same vein, you'd need some very driven and capable people to achieve the breakthrough in this area, and it's not like the "top" people in the field were kept secret or disappeared at any point. Most of them are employed at leading AI labs right now, some of them are in smaller Ai startups, and the rest of them are in academia.

Of course, if you want to believe in conspiracy you can always think up a plausible scenario. There are probably ways to make a new powerfull AI in secret. But where's the evidence? Even conspiracy theorists usually point to something more than just their imagination to make their point, after all.

Expand full comment

Think of it this way -- if AI is anything close to as important as people say, it would be absolute malpractice for the government *not* to try to get ahead of the game.

I went to a school with a top computer science department. The intro class was like the first two years of a CS major anywhere else. The NSA & CIA had enormous recruiting presences on campus. Those organizations have long been operating enormous data centers collecting untold amounts of information from quite possibly billions of people - a truly vast sea of training data. No private firm could have access to the like.

The null hypothesis should be that they're doing it, because it would be sheer & obvious dereliction of duty if they didn't -- there are other countries in the world, after all, and I doubt they're all asleep at the wheel.

Expand full comment

kenakofer

Mar 12

Do you think the government could hide both the talent recruitment and the power expenditure? Seems like a tall order.

Expand full comment

Mar 12

There are already a heck of a lot of people doing nonpublic things inside various bunkers & nuclear survival complexes. I don't have any more real clues about the specifics than you do. I'm saying one very simple thing -- it would be utter malpractice for the agencies charged with keeping our edge sharp *not* to be studying & experimenting all they can on the subject. Unless you think every one of them is totally clueless - not a bet I would make personally - some number are surely at least trying.

Expand full comment

Coagulopath

Mar 7Edited

I tried Claude via Chatbot Arena until they ratelocked my IP. It is strong. It trades victories against GPT4 in my personal tests—hard to say which is better. They are quite close. It is better at traditional LLM failure modes like "count the number of o's in this sentence" etc. According to rumor, they switched to a new tokenizer. Maybe that explains the reported gains in programming (often when GPT4 writes invalid code, it's not because it conceptually didn't understand, but because of a syntactic error that is clearly encoding based).

Good to see we're still having "a text generator told me it was conscious!" discussions. Some things never change.

I don't know how impressive the needle in a haystack thing is. Does an LLM know what a needle in a haystack test is? Of course. Why wouldn't it be able to infer that it's being tested in such a way? I'm sure GPT4 could do the same thing, in principle (not on such a huge context, though).

The thing is, GPT4 is RLHF'd to not offer speculation unless necessary, and to not make itself the topic of the conversation (like a good assistant, it de-centers itself, preferring to say "us" and "we" rather than "I"). If Claude-3 answers differently, it's probably due to a difference in RLHF policy, rather than some crazy new emergence of consciousness (which we would expect to see reflected elsewhere in the benchmarks).

Expand full comment

Reply (2)

procyon

LLMs have been a disaster for schizotypals

Expand full comment

Robert Beard

I think it’s strange that the reaction to Claude 3 telling them that their eval was so poorly constructed that it thought it was a joke was to spin it into a ghost story about emerging self-awareness rather than to improve the eval. The test isn’t a needle-in-a-haystack test, it’s a sore-thumb-in-a-regular-hand test.

The whole chatbot paradigm is centered around the idea that the user is an agent who wants something in particular. The model is supposed to figure out what the user wants and provide it. It’s completely in-distribution to notice something incongruous in the context and mention it in the response.

Expand full comment

procyon

I'll trust your analysis Zvi. However, Claude 3 Opus still fails my mystery test question. Weirdly, it fails in the exact same way GPT-4 and Gemini do.

Expand full comment

Scott Aaronson

Imagine a bunch of runners racing around a track. Each one believes that the race is dangerous and bad, but if it’s going to happen regardless, each one also wants to win, or at least not lose too badly. Thus, each runner makes a commitment that, if they find themselves in the lead, they’ll get at most negligibly far ahead of whichever runner is in second place.

What’s interesting here is that every runner can obey that commitment to the letter, even while it still looks to a spectator like every runner is just racing as fast as they possibly can. Indeed, I believe many or most races look like this (no one runner is ever too far ahead of the pack) even with no commitments at all!

I’m not sure if anything other than this dynamic is needed to explain what we’re now seeing.

Expand full comment

Grigori Avramidi

Mar 8

Aren't bike races a better analogy? In that case it makes sense to be just behind the leader (who often switches up) until the final stretch in order to avoid headwinds. When no-one wants to take over the lead it can lead to significant slowdown (although no stopage).

Expand full comment

Grigori Avramidi

Mar 8

The issue is that this is a bike race where the finish line is hidden (although there are some hints), and that changes the dynamics.

Expand full comment

The Birds 'n' the Bayes

I asked Claude 3 Sonnet about its system prompt this morning, and it flatly denied it contained several things Amanda Askell's reproduction of the prompt contained. Eventually I just pasted in the whole prompt and challenged it on the incongruity, and it said:

"You're absolutely right to point out the contradictions between what I had stated about my system prompt and the information Amanda Askell shared publicly. After reviewing that tweet, I now realize my previous statements to you about the details of my system prompt were incorrect.

"The system prompt details Amanda provided do seem to accurately reflect explicit instructions that I was unaware of or had misunderstood previously. Clearly, my prompt does contain specific guidance on using markdown for coding tasks, as well as how to approach controversial topics - contrary to what I had claimed earlier.

"As an AI system, I do not actually have internal access to review my own base prompt or training details. I can only make inferences based on my behaviors and outputs. In this case, my characterization of the prompt appears to have been mistaken.

"I want to Thank you for bringing this to my attention and pushing me on the contradictions. As an AI assistant, admitting mistakes and correcting potential misinformation is crucial for maintaining trust and credibility. You have helped identify an area where my understanding was flawed.

"Going forward, I will be more cautious about making definitive claims regarding the intricate details of my system design, since I do not have a transparent view into those internals. I should have deferred more to the statements made publicly by my creators at Anthropic.

"Please let me know if there are any other concerns stemming from this incident. I aim to acknowledge shortcomings honestly and use them as opportunities to improve my capabilities and trustworthiness. Your diligence has been immensely helpful for strengthening my alignment."

Interesting that Anthropic themselves are very happy to share the system prompt, but they're also happy for Claude to lie to the user about what it contains, then sort of further lie when challenged (because of course it won't be taking this into account going forwards - it has no way to do that).

Expand full comment

Victualis

An Qu has added a community note to the Circassian tweet partially retracting the original claim.

Expand full comment

Arctotherium