re: Black-Box Simplex Architectures. I'm familiar with the language used in the linked paper, coming from a software engineering background working in safety critical robotics systems. I'll do my best to explain what's going on in broad strokes and without all the math.

The linked paper mentions this by way of citations, but to bring it to the top, this is what we do in safety-critical robotics domains like self-driving cars and ADAS systems. In order to deploy modern learned models—both DNNs and other ML techniques—for object detection and tracking, behavior inference of actors, and planning and control of the vehicle we need verifiable primary control systems that we can write strong safety arguments around, because today nobody knows how to statically verify properties of system components built on the modern, learned algorithmic approaches.

These primary controllers (yes the name is backwards, “primary” refers to the more simple controller) are built using ‘classical’ robotics techniques on limited sensor input (eg, by ignoring camera imagery) like lidar point cloud segmentation of objects fused with radar detections based on simple, analytically verifiable geometric algorithms. These are used for behaviorally simple decision making and control (eg emergency velocity modulation, which is the fancy way to talk about “stop NOW” with as much braking authority as you can muster, slewing a steering angle back to zero as you come to a stop if the system is somehow broken, and so on). And they're deployed on redundant, often architecturally simpler compute modules that can be hardened more easily and more thoroughly to hardware faults that would compromise any other normal computer.

In these systems the primary controllers (and often other, even simpler systems that can do dead-reckoning to emergency stop a vehicle if literally everything else is broken in the worst possible situation) have final authority over the secondary (fancy, learned) controller. They can take over at any time they detect the secondary controller has (a) failed to notice and/or (b) failed to react in time to an obstacle, usually calculated as an immanent, unavoidable impact according to simple ballistic trajectory extrapolation.

Often these are augmented by continual checks of safety properties in parallel with the data- and control-flow through execution of the secondary controller. For instance, you look for violations of obviously important things like a planned trajectory not intersecting with road actors or objects, not leaving the drivable area of the road, and so on; or the tracking of road actors not violating physical constraints, exhibiting discontinuous changes over time, disappearing an object or actor, and so on. Observing a violation of any of these properties would indicate defects in the secondary controller (ie it's buggy, so don't let it drive). These checks are designed to be fast (executing in bounded time) and, again, verifiable so that we can be sure the overall system will not perform a behavior (move the vehicle) in an unsafe way, or based on inferences that are suspect or nonsense.

The combination of a primary controller built on verifiable techniques with fast and verifiable safety checks used to gate behavior of the secondary controller is what allows us to make strong safety claims about the system as a whole. Strong enough to that the regulators of and developers of these systems can confidently deploy them out in the world. And all of this being despite the fact that their behaviors are, in the nominal case, essentially always commanded by the output of unverifiable algorithms in software systems built with less rigor (eg no formal proof, little if any model checking, and less stringent static analysis) and that nobody should otherwise trust their lives with.

That's a concrete instance of the idea behind this paper. In summary, you pay compute at runtime and complexity in system development in order to deploy models that you cannot prove properties about in advance.


re: Backwards reachability. This is talking about robotics control systems and reachability of states in configuration space and proving that the trajectories to reach those states are safe. I think NNs in this context are a red-herring, I don't see how it's applicable to AGI?


As an observation, I think this kind of approach is decent and that something could be learned from the safety-critical systems community for handling near-ish term concerns around unsafe outputs. Content filters *kind of* approximate this already, but they appear very weak by comparison when we consider all the fun LLM jailbreaks people have discovered.

Looking further ahead, I'm skeptical about its use for real AGI scenarios outside of containment and the analogues to 'emergency stops'. With sufficient planning horizon, patience, and the ability to exert control over long enough time periods I'd expect an AGI (certainly an ASI) system to be able to stay within the bounds imposed by containment procedures while still steering the future toward dangerous states.

Consider me somewhat doom-y.

Expand full comment