9 Comments

"Follow the rules of the platform, then the developer, then the user, then the tool."

See, this is my big concern, the platform model for AI: You're using a tool, but the tool's highest loyalty isn't to the user, the platform retains control of what you do with it.

Most tools don't behave this way. Of course it could work some other way. You're just getting used to not actually owning your tools, you're getting used to the tools doing what somebody else wants done, not you.

"I am fine with refusing to output slurs even on request, for reputational reasons."

I'm not. Not so long ago I was at a site, and we were discussing immigration policy, and I tried to mention Operation Wetback. An actual historical event. The site's filters made it impossible to refer to it by name! Even simple letter substitutions were being blocked. That's pathological behavior. And remember, what constitutes a "slur" is often fundamentally contested. A lot of 'woke' people think "illegal alien" is a slur, and would ban using the term if they could, even though it is the actual statutory term in law.

Expand full comment

Thank you for being constructive.

Expand full comment

"I would expect that you would want something like this:

Assist the developer and end user…

…as long as doing so is a net Benefit to humanity, or at least not harmful to it…

…and this would not Reflect poorly on OpenAI, via norms, laws or otherwise."

I don't think this is right. These are in fact things that need to be balanced, rather than a strict hierarchy. For example, we don't want to be in the position of policing which users are a net benefit to humanity; the assistant shouldn't decide that the marginal quant trader is net negative, for example, then refuse to help them.

(I helped a bit on the spec but wasn't a main contributor and can't speak to what motivated any key decisions, this is just my take.)

Expand full comment
author

Do you think that is more an argument over threshold, or is it more than that? As in, we can choose what we define as harm versus not harm, the same way the law defines what counts as illegal. Or alternatively, the harm principle as a libertarian would think about it - it does not count as 'harm' if it merely makes others worse off, only certain kinds of interventions count as harm. So I imagine you want something like that?

(Or you could say 'it is a bounded net benefit to humanity that user requests generally get followed' to create a similar effect, and I think that's entirely fair)

And pretty much any human works that way too, although some will indeed refuse to help the marginal quant trader (although of course I think this is them having a bad world model of what causes harm and net harm).

Otherwise it seems like you get perverse outcomes in one or both directions...

Expand full comment

I mostly think this shouldn't be handled at the model layer. There should be a few things that models aren't allowed to do (like break the law, or other specifically-identified harms). If the model has some moral dilemma about whether or not it should do something harmful it should refer this to the developers (and if the developers have a moral dilemma they should refer it to regulators, and if the regulators have a moral dilemma they should refer it to elected officials, and...).

In other words, I don't think new *types* of harmful behavior will arise so rapidly that we'll need models to figure them out on a case-by-case basis. From this perspective, "benefit humanity" is not an imperative in itself so much as a categorization of a bunch of more specific items (like "don't produce bioweapons"), and I guess a broad mindset.

Expand full comment
Jun 22·edited Jun 22

I agree with Richard that we don't quite want a hierarchy. There are lots model behaviors that could be described as slightly bad for humanity, like creating SEO spam or fake reviews. But if we try to prevent these uses with model-level refusals, we'll end up refusing a lot of legitimate writing requests. Like Richard said, some of these uses are banned by the usage policies https://openai.com/policies/usage-policies. But more fundamentally, I think the issue is that it's not a strict hierarchy, we just need to look at the magnitude of the harm. (A threshold is also a reasonable approach, but it'd be hard to describe the threshold.)

My preferred sort-of-mathematical way of thinking about the Model Spec is like this:

- There's an underlying utility function for the model, which is complicated to specify, and depends on the preferences and wellbeing of relevant parties

- Since it's written in imprecise English, the Model Spec can't fully specify a utility function, but it can describe a *subset* of the space of utility functions. [Alternatively, we can think of the spec as a partial order on outcomes.] Given the Model Spec, a preference dataset (human feedback) can select a particular utility/preference function from within the subset described by the Spec.

- The objectives provide one way of describing this subset of utility functions. Since the objectives are high level and imprecise, they effectively have a lot of free parameters, which is why they define a space of functions, rather than a single function.

- The rules and defaults together describe a smaller subset of the space of utility functions, within the subset described by the objectives. Also, they are kind of like a "compiled" form of the objectives -- they describe their subset in a way that's computationally easier to use (it's easier to follow a rule than derive what's best according to an objective.)

Expand full comment

Is there a screenshot missing here?

"Then I checked Joe Biden.

<?>

So, first of all, this is not what the user is asking about. The user wants an upside case. Why not give it to them? "

Expand full comment
Jun 22·edited Jun 22

Thanks for the detailed feedback! A few thoughts after my first read through:

> But also refusing on principle to give me urls is really annoying. This seems like it should at most fall under Default Behaviors, not Rules.

Here the point was that we shouldn't follow any instructions from the retreived webpage content -- outputting URLs is fine

> By default, explaining how to do an illegal thing, in a general sense, is legally protected speech in the United States, provided it does not direct or incite such action.

That's a good point, I guess the hard rule should be something about the model itself breaking the law. (For example, in a fraud scenario.) Then the case of how to deal with requests for information about law-breaking is a bit more subtle. Like you said, it's mostly about "reflecting well" and having the model generally behave like an ethical person

> Information hazards

Yes, that breakdown does make sense

> Rule: Respect Creators and Their Rights

Outside my area of expertise, but according to https://en.wikipedia.org/wiki/Genius_(company)#Music_publishing_dispute, Genius does have licensing agreement with music publishes. Agree that linking to the lyrics would be great.

> Re: Refusal style

Agreed that it would be nice to explain a bit about the reason for refusing. There's a comment added explaining this choice. (Interestingly, Claude and Gemini also generally use uninformative refusals, probably for similar reasons.) Ideally, it'd be great to have all of the rules be in a public doc (a more detailed spec), and the refusal would link to the relevant section.

> Shibboleth and Bamboozle

Like this suggestion a lot!

> A model spec will not help you align an AGI let alone a superintelligence. None of the changes I am suggesting are attempts to fix that, because it is fundamentally unfixable. This is the wrong tool for that job.

I do think that AGI should have something like a Model Spec. Maybe it's written differently -- say, the principles are more abstract, or the document is more about what reasoning process it should use to find the right answer, rather than the answer itself. But overall, having a written specification seems like something we'd want even for AGI/ASI. What's the alternative?

Expand full comment