Some subtypes of taskishness / corrigibility

"Corrigibility" is somewhat of an overloaded term in alignment - it points in the direction of a cluster of desirable properties, but different people have different ideas of what this entails.

I think of "corrigibility", as it is used, to cover a few different ideas. I will name some of these and sort them roughly in order of how much of the good outcomes from deploying such a system are in the hands of the AI, rather than the human operator.

Sponge corrigibility - The AI is corrigible and follows orders because it's not very smart and has otherwise been trained to do approximately that. GPT-4 is corrigible in this sense. You can ask GPT-4 to do something and it will do the thing and then stop, because as far as agency goes it behaves as an ordinary piece of software.

Boundedness / myopia - The AI is smart, but does not think about certain aspects of the world, which make it possible to correct because it does not imagine some classes of strategies that would be helpful for resisting correction.

In an ideal setting, such an AI would also have a harder time thinking of plans that stop it from being myopic; the benefits of thinking about a certain part of the world route through that part of the world, which it's not thinking about. Though there remain many ways for myopic agents to act in non-myopic ways, including simply that there is no particular pressure to stay myopic. A successor that makes 10 paperclips a day forever and a successor that makes 10 paperclips today then shuts down both look the same to a myopic agent that only thinks about today, so it doesn't disprefer the former (and might even favour how it's algorithmically simpler).

In the off-switch problem in the Corrigibility paper, desiderata 1-3 (an agent must shut down if the shutdown button is pressed, an agent must not prevent its shutdown button from being pressed, an agent must not press its own shutdown button) can be cast into a problem of this type.

Many of the corrigibility criteria in Corrigibility at some small length seem to fall into this category; e.g. using only for loops rather than while loops to limit the scope of thoughts, behaviourism putting a limit on it modelling the internals of other minds to prevent emulating another non-corrigible agent.

Reflectively stable taskishness - The AI is a tool AI and wants to stay that way. A bounded agent might create unbounded successors incidentally without directly strategising about the unbounded parts of their behaviour; a reflectively stable taskish agent creates successors only if corrigibility is preserved in the successors, and doesn't want to end-run around its corrigibility with an incorrigible successor.

In the off-switch problem in the Corrigibility paper, desideratum 4 (an agent must construct subagents and successor agents only insofar as they obey shutdown commands) can be cast into a problem of this type.

Some of the corrigibility criteria in Corrigibility at some small length seem to fall into this category; e.g. shutdownability / abortability is also defined as something that is preserved in its subagents and plans.

Deep corrigibility - The AI acts as if it is incomplete and requires correction. It is replicating the conclusions of the reasoning of the human operators, who are also uncertain about whether it has been made correctly and who would want it to be careful.

An almost-deeply-corrigible agent looks at its code, notices that this part of the code would cause it to tile the universe in paperclips, and either shuts itself down or repairs the bug. It looks at its code, notices that this part of the code would cause it to create a diverse utopia full of thriving civilisations, and either shuts itself down or repairs the bug. It was asked to do something, and it will do that thing and only that thing, and not other things that it could plausibly be mistaken about.

Often, hope for this being a relatively easy approach/framing/target comes from a stance that corrigibility is a basin of attraction; something that is a little bit deeply corrigible will notice that it's not being corrigible, and modify itself to be more corrigible.

The Arbital page on the Hard Problem of Corrigibility gives this subtype as the Hard Problem of Corrigibility.

The “hard problem of corrigibility” is to build an agent which, in an intuitive sense, reasons internally as if from the programmers’ external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, “I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I’ve done this calculation showing the expected result of the outside force correcting me, but maybe I’m mistaken about that.”

(though I think that what one would really want is for it to act like this; thinking like this internally is one way, and the most natural way, of achieving this, but not necessarily the only way)

Christiano's Corrigibility post hypothesises the basin of attraction around corrigibility:

In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

Harms' CAST proposal also puts its hope in corrigibility being a basin of attraction.

Solving outer alignment - Humans want the AI to be corrigible, and the AI is uncertain about human values. So, being outer aligned to humans, the AI asks lots of questions of the operators, doesn't make large changes in the world, shuts down when the human operators ask, and holds all the other desirable traits of corrigibility, stemming from following the do-what-I-mean of the human operators.

Often, hope for this being a relatively easy approach for "corrigibility" (this is more of a full alignment solution that replicates corrigibility insofar as corrigibility is desirable) comes from tricks to elicit a lot of data about human preferences that already exist in the world, instead of this strange and alien "corrigibility" property that only exists inside pure mathematics.

Attempts to use moral uncertainty to solve corrigibility are of this type.

View Original Article

0 0 Share

0 people liked this

More from this channel

Some subtypes of taskishness / corrigibility