Discussion about this post

User's avatar
Mo Diddly's avatar

“Things take longer to happen than you think they will, and then they happen faster than you thought they could.”

Is this the same as the observation that in general we overestimate how much progress will be made in the short term and underestimate the long-term?

Expand full comment
Sinity's avatar

> It makes sense, given the pace of progress, for most people and companies not to put that kind of investment into AI ‘employees’ or other AI tasks. But if things do start to stall out, or they don’t, either way the value proposition on that will quickly improve. It will start to be worth doing. And we will rapidly learn new ways of doing it better, and have the results available to be copied.

> And the core building block issues of computer use seem mostly like very short time horizon tasks with very easy verification methods. If you can get lots of 9s on the button clicking and menu navigation and so on, I think you’re a lot of the way there.

If model progress completely stalled, there could be massive improvement in computer use by not trying to solve the problem the most brute-force human-centric way possible. https://x.com/_sinity/status/1930645004449443910

The following is excerpted from a summary of a chatlog. I don't know if these exact specific ideas would work, but _something_ in this general direction has to be viable.

> Sinity, sensing the untapped potential of the accessibility stack, presses for detail. Could AT-SPI2 give us not just pixels or keypresses, but a live widget tree—roles, relationships, text, and structure, all as digital data? Opus’s answer is categorical: yes. AT-SPI2 exposes the full widget hierarchy of GTK, Qt, Electron, and more. Window trees, menu bars, toolbars, document objects, text fields, headings, labels—everything accessible is structured and streamable.

> [...] With streams of this richness, Sinity notes, “this is EXACTLY what computer use agents need!”—contextual state, visible text, available actions, and focus history, all in real-time, all structured. Agents, whether AI or user-augmented, could reason not just about “what window is open,” but what that window is, what is visible, and what can be done.

> Yet the meta-question lingers: Why isn’t this already standard? Sinity expresses incredulity—“EMH-intuitions scream at me that I must be deluded about this for some reason.” Opus explains: the technical path is clear, but social and market forces (privacy paranoia, lack of obvious products, engineering overhead, and cloud-first inertia) have prevented a unified, user-centric data layer from taking hold.

> [...] Finally, Sinity raises the output side: “there’s read, and then there’s write.” Could these same hooks (compositor, input devices, Neovim/terminal, accessibility, browser) serve as control points—letting agents automate actions, populate text fields, simulate clicks, or invoke commands? Opus affirms: Hyprland can synthesize input; AT-SPI2 can manipulate widgets directly; browser extensions can run arbitrary JavaScript in context. The infrastructure supports not just observation, but direct, programmatic intervention—a full closed-loop system for AI or user-driven automation.

Expand full comment
17 more comments...

No posts