Regarding your comments about Andrew Chen and his comments about Hollywood and creatives needing to think of themselves like software engineers. While I understand the point he is trying to make, I agree with you that it won't go over well with said creatives. More pertinently, I think this points to a flaw in technologists' and VCs' understanding of AI uptake by consumers and the businesses which sell stuff to consumers: yes, the AI is powerful but it is not yet powerful enough to traverse the uncanny valley. It's reasonable enough to argue that at some future point AI tech will traverse the uncanny valley, and it may even be reasonable to argue that that future will arrive sooner than many expect. But it is not here today.
Undergirding comments similar to his is, I think, a contempt for Hollywood and what Silicon Valley types (not unreasonably!) see as outdated business models. These situations, though, call for some kind of diplomatic strategy, and not simply an outright condemnation of the technophobic rubes who can’t get with the program.
Furthermore as a creative, I like having a soul to express. My creation is an expression of myself and creativity was one a paegan to the human process and the human story. The idea that a machine that cannot love to write love, a machine that cannot feel pain to write about distress, etc...it is all the sense of destruction of our dignity of the lived experience, of feeling, of existing in the first place.
Its a lot like asking, for example, Zvi to only copy and paste from prompts. Even if if is better, there is "something of him" that he wants to express. If the day comes that AI writers actively threaten anyone from reading him, I doubt he will like it.
Because there is the sense of a living network. Of you and me and him and all of us reading and producing for each other, of the idea of free time and effort being a thing, of being able to appreciate him for it. One wonders if his Magic playing affected his writing, if having children did, etc.
Increasingly, its hard to do that with art, not knowing who made, what was sacrificed, what was the story, or the soul behind it.
For us, AI has made the world already substantially worse.
As someone who spends a decent amount of hobbyist time in creative spaces, I can confirm that:
1. The average creative I know would rather livestream themselves doing a striptease on top of a fire ant nest than say something positive about generative AI.
2. I am half joking but also half not joking when I say that in those spaces- which, for the record, are pretty left-wing because every non-explicitly right-wing space becomes left-wing- you'll probably get in less trouble saying a bunch of racial slurs than saying "hey guys, using AI is cool."
What do we make of the fact that GPT4o is beating Sonnet 3.5 in the Arena, including on the subcategories of hard prompts and long queries (though not coding where the confidence intervals overlap significantly)?
I find it unfortunate and even disappointing Zvi isn't addressing this as opposed to sticking to the clear "best". It's also losing on some benchmarks he's previously cited, like BigCodeBench (https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) - on that one basically tied with GPT-4T and losing to GPT-4O.
The main issues I see with Sonnet 3.5 relative to GPT-4O are:
* Worse math ability. Not only outright worse calculating things ability (both pure API and the lack of inline python execution ability in chat), but also it is making more deduction errors on harder math problems I pose. (on a physics calculus problem I've been testing LLMs with, half the time it confuses dx/dt with dx/dv -- underperforming even GPT-4T).
* More creativity does seem to come at the cost of accuracy. I personally prefer Sonnet 3.5 in most contexts because I'm going to error check regardless (and I'm biased toward helping Anthropic over OpenAI), but GPT-4O does feel safer for a downstream application with no human in loop (I wouldn't say "safe", but "safer").
* Higher refusal rate, which explains about 40% of the Sonnet/GPT-4O gap in the lmsys leaderboard.
Outside the scope of Lmsys, GPT-4O also seems slightly more "stable" over longer chats -- I see more internal hallucinations (forgetting previously imposed constraints) with Claude 3.5. Again, may be a cost of higher creativity.
The Arena thing didn't get noticed by me until close to press time and I literally forgot to incorporate it before hitting send, but that largely confused me and made me question what Arena is actually measuring. Refusals might be a big part of that and I admit I've not yet encountered even one. I hadn't seen CodeBench until you made the link, but again, you know, that's... pretty weird to me.
I agree it gives one pause, but I know that for most purposes it would seem crazy to use GPT-4o at this point.
It's possible that (1) I don't do the things where Claude underperforms so I haven't seen them and (2) the people who give feedback are biased. We'll see.
I’ve started a fiction substack, mainly sci-Fi/horror/both, and so far a high proportion of the stories are inspired by reading these posts regularly over the last couple of years. If anyone is interested in seeing what it looks like for a tech neophyte with a background in writing to stumble into the world of AI and AI risk in particular, please do check it out.
…and yesterday I published an alien contact story that isn’t (only) about alien contact, then had a conversation with Claude Sonnet 3.5 about the story, which I found remarkable enough that I put it up, in full, along with some thoughts/reactions: https://pulpstack.substack.com/p/claude-and-i-discuss-daffar-quiu
>Nor do I think instruction handling issues are that big a deal here, but I will keep an eye out for other complaints.
This is a very big deal to me. Claude cannot be used in any sort of enterprise context if it can't follow instructions. I don't care how smart it is if it can't follow instructions.
Both. Claude is bad at following system instructions/system message/system prompt. And, following system instructions is very important for for enterprise applications.
Meanwhile, GPT models are very good at following system instructions.
> Sometimes I wonder if these false positives are good, actually? If the AI thinks an AI wrote your paper, and instead you wrote your paper, what does that say about your work? What grade do you deserve?
The question would be what are these false positive checkers doing. If relatively small models then I'd presume they are mostly picking up on stylistic parts of text rather than strong logical content. (Similar to how GPT2 was good at style but not yet quite there with logic/causation/etc)
If your writing style is matching current chatbots with all their linguistic quirks, then yeah you probably need a better style, but in many classes wouldn't affect grades much.
Notably, the difference between detecting the degree of irrelevant rambling (bad essay) vs. eg. noticing that the essay uses "delve" as a verb a lot which (IIRC) is a known tic of ChatGPT but presumably also of some actual humans, and doesn't say anything about the quality of the writing
> sees AI bots as displacing some of the world’s oldest profession by producing simulated intimacy, which she says is what most sex work is ultimately about
Contrarian (by nerdy Twitter standards) take: sex work is mostly about sex. There's a movie trope where its actually about feelings and high-end sex workers probably extract more money out of their clients if they act interested, but ultimately its all about the sex. West World's bots could certainly replace sex workers but even the worlds most convincing AGI could not.
Have you, like Schrager, who spent months in legal brothels and wrote a book about it, done any quantitative research about the topic? Or are you just vibing here?
I’m vibing based off the observed motivations of my fellow male peers. I also don’t think a book that said “brothels are mainly about sex!” would sell all that well so there’s some incentive to bend the truth.
I also suspect a lot of bias: sex workers want to think they’re doing more than just providing access to a hot body. Clients want to say they’re interested in more than just having access to a hot body, when interviewed for a book.
Behold! AI Dungeon was neat for 3 minutes, and then it became very clear that it's just grabbing text strings based on an algorithm, without any attention to an overall plot, any concept of continuity or any sensible world-building.
The fact that it can "switch around" to play the characters is clever, but we expect that the characters it creates to adventure in "your" dungeon will be equally lacking in existential consistency.
Re Character.AI... I think this is a real trend being overlooked in the mainstream right now.
If you take the time to set up oobabooga with an uncensored pygmalion model, you can go down some pretty interesting paths. It's the sort of thing where you an end up just trying different things out with zero real-world cost, then stumbling into an interaction that gives you an actual feeling & makes you reflect on what you are missing in your real relationships.
Superficially I agree that these models are not yet good enough, and it's not like you'll forget you are talking to an AI. But they can still provide a supportive place to be vulnerable - after all, if you don't like the character's response you can just hit regenerate, or even give it the first few words of what you want to hear.
But I predict we'll see a future with some decent percentage of people choosing AI companions over the inconsistency and hurtfulness of real people.
So, when it comes to securing the weights, here's a threat vector you might consider.
That self-righteous, anti-privicist organized data-theft twit Julian Assange is now free, having met justice by pleading guilty to a felony in a US court (time served, seems fair, I'm satisfied). Hes out of the game, but It's someone like him--exactly like him--who can sweet-talk some jaded coder, eh, researcher into absconding with the weights--any weights, consider the most secret and dangerous as your example--so that they can be liberated for all the peoples of the world. Good luck getting a security protocol for that, and yes, you might want to ask the Pentagon for help, they have a little more experience with this than the labs do.
I mean, "lone Insider with motive, minimal extra resources" is literally OC3, which is the level many reckon should be standard to try and defend against even in the world where AI is merely a very economically valuable tool - such as today's world. The required protocol was detailed in this post, and there are organisations that meet in (eg. I expect most defence contractors have at least components of their work at that standard).
I know that's a late reply, but I hope you see this. Every time you mention character.AI I feel like I need to say something, but I could never quite pin it down.
I have a point of reference to compare character.ai (and similar apps) to. Back when I was a teenager, I used to participate in some fantasy and sci-fi chats and forums, some of which had roleplaying sections. Not the hardcore kind, very casual, just text. It was fun, but it was actually kind of terrible in many ways.
You had to find people interested in the same theme. They would have to be able to match your time more or less. They would reply very slowly. They would write poorly. Often both. They would have different ideas of where there story should go. They would have stupid dramas. They would quit and never come back. If you had some weird ideas, it would be socially awkward to bring them up (at best). And so on.
The fact that they were real humans was hardly an advantage, one was not playing that sort of thing to experience real human connection. At best, to pretend to experience one. A person on the other side of the screen was mainly the downside, not the upside.
Often you'd spend some 30 minutes writing your turn, then wait anywhere from a few minutes to a few hours, per each participant, untill you get to write again. And then someone would just go afk and it all collapsed, or someone disagreed about direction of the story, etc etc.
So in comparison to this experience, character. AI is just so much better it's not even funny.
Yes it can't match the best writers out there. It wipes the ground with 90% of roleplayers though.
Yes it loses the thread of conversation and does stupid things occasionally. In that sense it's worse than most players, but still, I definitely met humans that were not much better in that aspect.
It refuses to do some things, mainly sexual (which to be frank would be out of the question when playing in a chatroom or on a forum anyways), but overall it's still a lot less restrictive.
My biggest gripe with character AI from what I little I've seen of it... It has a weird tendency to lead to suggestive themes it can't actually play out, even when it's not my intention at all. But maybe I just was unlucky.
But on the upside it removes so much friction. It's there, all the time, free, ready to reply, fast, with passable style, no serious mistakes. And it can play multiple characters.
As there's no person there to speak of you don't need to care about offending anyone, so you can try anything, and you can get away with very low-effort writing yourself.
Yeah I can totally see how someone can get hooked.
And as for longer sessions.. Well, there's no natural stopping point? As long as you don't get interrupted, you can just keep writing.. I mean, mostly reading. On your phone, so literally anywhere at any time.
And that's only one perspective. I'm sure there's more to it.
Is that all kinda pathetic and sad, actually? I'm inclined to agree, but then again, all escapism kinda is. Is worse than reading young adult fiction, playing league of legends, binging tv shows, etc? It might be, but I would not assume it to be the case.
Regarding your comments about Andrew Chen and his comments about Hollywood and creatives needing to think of themselves like software engineers. While I understand the point he is trying to make, I agree with you that it won't go over well with said creatives. More pertinently, I think this points to a flaw in technologists' and VCs' understanding of AI uptake by consumers and the businesses which sell stuff to consumers: yes, the AI is powerful but it is not yet powerful enough to traverse the uncanny valley. It's reasonable enough to argue that at some future point AI tech will traverse the uncanny valley, and it may even be reasonable to argue that that future will arrive sooner than many expect. But it is not here today.
I suspect, on a related note, that NBC is going to get a lot of pushback and bad publicity for its "AI Al Michaels": https://www.hollywoodreporter.com/business/business-news/nbc-artificial-intelligence-al-michaels-peacock-paris-olympics-recaps-1235932093/
Yeah, I wasn't explicit enough but hot damn do I expect a bad reaction to that.
Undergirding comments similar to his is, I think, a contempt for Hollywood and what Silicon Valley types (not unreasonably!) see as outdated business models. These situations, though, call for some kind of diplomatic strategy, and not simply an outright condemnation of the technophobic rubes who can’t get with the program.
What I wish is that Hollywood comes out outright to defend themselves against AI, especially since there are existential risks here.
Its good to stand up for good.
Furthermore as a creative, I like having a soul to express. My creation is an expression of myself and creativity was one a paegan to the human process and the human story. The idea that a machine that cannot love to write love, a machine that cannot feel pain to write about distress, etc...it is all the sense of destruction of our dignity of the lived experience, of feeling, of existing in the first place.
Its a lot like asking, for example, Zvi to only copy and paste from prompts. Even if if is better, there is "something of him" that he wants to express. If the day comes that AI writers actively threaten anyone from reading him, I doubt he will like it.
Because there is the sense of a living network. Of you and me and him and all of us reading and producing for each other, of the idea of free time and effort being a thing, of being able to appreciate him for it. One wonders if his Magic playing affected his writing, if having children did, etc.
Increasingly, its hard to do that with art, not knowing who made, what was sacrificed, what was the story, or the soul behind it.
For us, AI has made the world already substantially worse.
As someone who spends a decent amount of hobbyist time in creative spaces, I can confirm that:
1. The average creative I know would rather livestream themselves doing a striptease on top of a fire ant nest than say something positive about generative AI.
2. I am half joking but also half not joking when I say that in those spaces- which, for the record, are pretty left-wing because every non-explicitly right-wing space becomes left-wing- you'll probably get in less trouble saying a bunch of racial slurs than saying "hey guys, using AI is cool."
What about the Etched chip company Sohu anouncement? How significant do you this can be if it works?
link?
Yup, sorry:
https://x.com/Etched/status/1805625693113663834
https://www.etched.com/announcing-etched
Podcast episode for this post:
https://askwhocastsai.substack.com/p/ai-70-a-beautiful-sonnet-by-zvi-mowshowitz
What do we make of the fact that GPT4o is beating Sonnet 3.5 in the Arena, including on the subcategories of hard prompts and long queries (though not coding where the confidence intervals overlap significantly)?
I find it unfortunate and even disappointing Zvi isn't addressing this as opposed to sticking to the clear "best". It's also losing on some benchmarks he's previously cited, like BigCodeBench (https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) - on that one basically tied with GPT-4T and losing to GPT-4O.
The main issues I see with Sonnet 3.5 relative to GPT-4O are:
* Worse math ability. Not only outright worse calculating things ability (both pure API and the lack of inline python execution ability in chat), but also it is making more deduction errors on harder math problems I pose. (on a physics calculus problem I've been testing LLMs with, half the time it confuses dx/dt with dx/dv -- underperforming even GPT-4T).
* More creativity does seem to come at the cost of accuracy. I personally prefer Sonnet 3.5 in most contexts because I'm going to error check regardless (and I'm biased toward helping Anthropic over OpenAI), but GPT-4O does feel safer for a downstream application with no human in loop (I wouldn't say "safe", but "safer").
* Higher refusal rate, which explains about 40% of the Sonnet/GPT-4O gap in the lmsys leaderboard.
Outside the scope of Lmsys, GPT-4O also seems slightly more "stable" over longer chats -- I see more internal hallucinations (forgetting previously imposed constraints) with Claude 3.5. Again, may be a cost of higher creativity.
The Arena thing didn't get noticed by me until close to press time and I literally forgot to incorporate it before hitting send, but that largely confused me and made me question what Arena is actually measuring. Refusals might be a big part of that and I admit I've not yet encountered even one. I hadn't seen CodeBench until you made the link, but again, you know, that's... pretty weird to me.
I agree it gives one pause, but I know that for most purposes it would seem crazy to use GPT-4o at this point.
It's possible that (1) I don't do the things where Claude underperforms so I haven't seen them and (2) the people who give feedback are biased. We'll see.
I’ve started a fiction substack, mainly sci-Fi/horror/both, and so far a high proportion of the stories are inspired by reading these posts regularly over the last couple of years. If anyone is interested in seeing what it looks like for a tech neophyte with a background in writing to stumble into the world of AI and AI risk in particular, please do check it out.
This is a fun little story about immersive AI, recording dreams and the last day in human history as we know it - https://pulpstack.substack.com/p/recursion
…and yesterday I published an alien contact story that isn’t (only) about alien contact, then had a conversation with Claude Sonnet 3.5 about the story, which I found remarkable enough that I put it up, in full, along with some thoughts/reactions: https://pulpstack.substack.com/p/claude-and-i-discuss-daffar-quiu
>Nor do I think instruction handling issues are that big a deal here, but I will keep an eye out for other complaints.
This is a very big deal to me. Claude cannot be used in any sort of enterprise context if it can't follow instructions. I don't care how smart it is if it can't follow instructions.
Is that a statement about its actual abilities, or how important the ability is in general? As in, are you saying that Claude is bad at this?
Both. Claude is bad at following system instructions/system message/system prompt. And, following system instructions is very important for for enterprise applications.
Meanwhile, GPT models are very good at following system instructions.
> Sometimes I wonder if these false positives are good, actually? If the AI thinks an AI wrote your paper, and instead you wrote your paper, what does that say about your work? What grade do you deserve?
The question would be what are these false positive checkers doing. If relatively small models then I'd presume they are mostly picking up on stylistic parts of text rather than strong logical content. (Similar to how GPT2 was good at style but not yet quite there with logic/causation/etc)
If your writing style is matching current chatbots with all their linguistic quirks, then yeah you probably need a better style, but in many classes wouldn't affect grades much.
Notably, the difference between detecting the degree of irrelevant rambling (bad essay) vs. eg. noticing that the essay uses "delve" as a verb a lot which (IIRC) is a known tic of ChatGPT but presumably also of some actual humans, and doesn't say anything about the quality of the writing
> sees AI bots as displacing some of the world’s oldest profession by producing simulated intimacy, which she says is what most sex work is ultimately about
Contrarian (by nerdy Twitter standards) take: sex work is mostly about sex. There's a movie trope where its actually about feelings and high-end sex workers probably extract more money out of their clients if they act interested, but ultimately its all about the sex. West World's bots could certainly replace sex workers but even the worlds most convincing AGI could not.
Have you, like Schrager, who spent months in legal brothels and wrote a book about it, done any quantitative research about the topic? Or are you just vibing here?
I’m vibing based off the observed motivations of my fellow male peers. I also don’t think a book that said “brothels are mainly about sex!” would sell all that well so there’s some incentive to bend the truth.
I also suspect a lot of bias: sex workers want to think they’re doing more than just providing access to a hot body. Clients want to say they’re interested in more than just having access to a hot body, when interviewed for a book.
Behold! AI Dungeon was neat for 3 minutes, and then it became very clear that it's just grabbing text strings based on an algorithm, without any attention to an overall plot, any concept of continuity or any sensible world-building.
The fact that it can "switch around" to play the characters is clever, but we expect that the characters it creates to adventure in "your" dungeon will be equally lacking in existential consistency.
Re Character.AI... I think this is a real trend being overlooked in the mainstream right now.
If you take the time to set up oobabooga with an uncensored pygmalion model, you can go down some pretty interesting paths. It's the sort of thing where you an end up just trying different things out with zero real-world cost, then stumbling into an interaction that gives you an actual feeling & makes you reflect on what you are missing in your real relationships.
Superficially I agree that these models are not yet good enough, and it's not like you'll forget you are talking to an AI. But they can still provide a supportive place to be vulnerable - after all, if you don't like the character's response you can just hit regenerate, or even give it the first few words of what you want to hear.
But I predict we'll see a future with some decent percentage of people choosing AI companions over the inconsistency and hurtfulness of real people.
So, when it comes to securing the weights, here's a threat vector you might consider.
That self-righteous, anti-privicist organized data-theft twit Julian Assange is now free, having met justice by pleading guilty to a felony in a US court (time served, seems fair, I'm satisfied). Hes out of the game, but It's someone like him--exactly like him--who can sweet-talk some jaded coder, eh, researcher into absconding with the weights--any weights, consider the most secret and dangerous as your example--so that they can be liberated for all the peoples of the world. Good luck getting a security protocol for that, and yes, you might want to ask the Pentagon for help, they have a little more experience with this than the labs do.
I mean, "lone Insider with motive, minimal extra resources" is literally OC3, which is the level many reckon should be standard to try and defend against even in the world where AI is merely a very economically valuable tool - such as today's world. The required protocol was detailed in this post, and there are organisations that meet in (eg. I expect most defence contractors have at least components of their work at that standard).
I know that's a late reply, but I hope you see this. Every time you mention character.AI I feel like I need to say something, but I could never quite pin it down.
I have a point of reference to compare character.ai (and similar apps) to. Back when I was a teenager, I used to participate in some fantasy and sci-fi chats and forums, some of which had roleplaying sections. Not the hardcore kind, very casual, just text. It was fun, but it was actually kind of terrible in many ways.
You had to find people interested in the same theme. They would have to be able to match your time more or less. They would reply very slowly. They would write poorly. Often both. They would have different ideas of where there story should go. They would have stupid dramas. They would quit and never come back. If you had some weird ideas, it would be socially awkward to bring them up (at best). And so on.
The fact that they were real humans was hardly an advantage, one was not playing that sort of thing to experience real human connection. At best, to pretend to experience one. A person on the other side of the screen was mainly the downside, not the upside.
Often you'd spend some 30 minutes writing your turn, then wait anywhere from a few minutes to a few hours, per each participant, untill you get to write again. And then someone would just go afk and it all collapsed, or someone disagreed about direction of the story, etc etc.
So in comparison to this experience, character. AI is just so much better it's not even funny.
Yes it can't match the best writers out there. It wipes the ground with 90% of roleplayers though.
Yes it loses the thread of conversation and does stupid things occasionally. In that sense it's worse than most players, but still, I definitely met humans that were not much better in that aspect.
It refuses to do some things, mainly sexual (which to be frank would be out of the question when playing in a chatroom or on a forum anyways), but overall it's still a lot less restrictive.
My biggest gripe with character AI from what I little I've seen of it... It has a weird tendency to lead to suggestive themes it can't actually play out, even when it's not my intention at all. But maybe I just was unlucky.
But on the upside it removes so much friction. It's there, all the time, free, ready to reply, fast, with passable style, no serious mistakes. And it can play multiple characters.
As there's no person there to speak of you don't need to care about offending anyone, so you can try anything, and you can get away with very low-effort writing yourself.
Yeah I can totally see how someone can get hooked.
And as for longer sessions.. Well, there's no natural stopping point? As long as you don't get interrupted, you can just keep writing.. I mean, mostly reading. On your phone, so literally anywhere at any time.
And that's only one perspective. I'm sure there's more to it.
Is that all kinda pathetic and sad, actually? I'm inclined to agree, but then again, all escapism kinda is. Is worse than reading young adult fiction, playing league of legends, binging tv shows, etc? It might be, but I would not assume it to be the case.