17 Comments

Re: Her, how about "ambitopia" as word for ambivalent -topias?

And aren't they actually pretty common? I think at the very least The Expanse, For All Mankind, Babylon 5, 2001 A Space Odyssey, HBO's Watchmen miniseries, most of Star Trek would qualify, to name some famous examples.

Expand full comment

Multi voiced AI Reading of this post:

https://askwhocastsai.substack.com/p/gpt-4o-my-and-google-io-day-by-zvi

Expand full comment

I wanted this specifically for Zvi's blog!!! Do you have it entirely automated?

Expand full comment

It's about 80% automated. There is some formatting and voice selection.

Expand full comment

I was thinking about doing this too actually, but I'm glad you got there first. Any progress on voice cloning for the primary speaker? Or chapters. Or sentence-skipping navigation. Anyway, already much better than substack's built-in feature. That makes writing with lots of quotes too unpleasant for me to listen to. Hmu if you want help.

Expand full comment

I actually find the generated voices more stable than cloned voices. Not sure I'm sold that chapters are worth the hassle. Any tips for easy implementation?

Expand full comment
author

I don't know why you'd want to do it, but if you wanted to clone my voice (plenty of podcasts to use as the source) you are free to do so.

Expand full comment

It was half baked

Expand full comment

Well, there's one thing we should update on: GPT-4o is natively multimodal but doesn't seem all that superior in textual abilities compared to GPT-4 Turbo. For the past year a lot of people said to me that data constraints wouldn't be a problem due to multimodal data. It depends on how much image, audio etc were used though, I suppose.

Now everyone is talking about synthetic data, even though no one has showed me how exactly that'd work out.

Expand full comment

I don't get this "doesn't seem better on text than gpt4 turbo"?

I used it on Arena multiple times for questions about topics like EU regulatory projects and it was generally better than everything else. And yes I can judge the answers in this area that's why I was asking, i just wanted the shortcut to writing them out.

Expand full comment

In my brief testing, the Gemini 1.5 Flash API is even faster than gpt-4o. I asked "As a test, can you send me 1, 2, 3... 500 on a single line?"

And this is what I got:

Gemini 1.5 pro (latest): 38s

Gemini 1.5 flash: 12.8s

GPT 4o: 22.4s

This is super exciting!

Expand full comment
May 17·edited May 17

This is little strange but I've noticed these AI imaging tools can't put anything on the shoulder of a road regardless of the prompt I use. I've tried chatGPT4o and now this ImageFX. I was trying to create an image of two men playing cards on the shoulder of a highway and no matter how I ask, both of these AIs put the men dead center of the road and the picture. I also can't get an image of a road from the perspective of the shoulder or a sidewalk. I just mention this because AI so strange on the amazing things it can do combined with the weird ordinary things it can't do. This update hasn't changed that.

I haven't tried this on Midjourney because of their stupid discord interface. (plus I have never gotten it to give me even one free image)

Expand full comment

> First it was hands. Then it was text, and multi-element composition. What can we still not do with image generation?

It still isn't perfect at rendering text. It _sometimes_ works but still far from perfect. Here's my Manifold market for it, including samples from 4o: https://manifold.markets/nsokolsky/will-dalle-4-be-able-to-generate-an

Expand full comment

I could be wrong but I thought I read somewhere that OpenAI had filtered the training data. Using one of the GPT 4 models to remove all of the racist/violent/sexist/other bad stuff. How much of the speed improvements do you think came from needing fewer output filter checks. No need to look to see if GPT-4o is going to use naughty language if it's never seen that stuff. If the closest it's seen to pornography is classical statues and paintings, you shouldn't need to worry as much about it producing pornographic images. This always seemed like an easy win once the models were good enough to recognize safe and unsafe text/sound/images.

Expand full comment
author

Based on my experiments with Stable Diffusion, it is easy to end up with a checkpoint that defaults many innocent requests to nude if you're not filtering, but also if you don't know HOW to draw a nude in some sense you can't draw people. There's a reason artists train on them.

Expand full comment

Reading tone of voice and facial expression is going to be a huge accelerator for self-supervised learning. It makes quality user feedback truly effortless; they don't have to settle for proxies like engagement. With 100s of millions of users, the data will quickly dwarf all other sources OpenAI has. Offering it for free, they must know that they have a pretty clear path to superhumanly-likable AI assistants now. Exciting times ahead! 😬

Expand full comment