In the fields of artificial intelligence and robotics, the term “autonomy” is generally used to mean the capacity of an artificial agent to operate independently of human guidance. To create agents that are autonomous in this sense is the central aim of these fields. Until recently, the aim could be achieved only by restricting and controlling the conditions under which the agents will operate. The robots on an assembly line in a factory, for instance, perform their delicate tasks reliably because the surroundings have been meticulously prepared. Today, however, we are witnessing the creation of artificial agents that are designed to function in “real-world”—that is, uncontrolled—environments. Self-driving cars, which are already in use, and “autonomous weapon systems,” which are in development, are the most prominent examples. When such machines are called “autonomous,” it is meant that they are able to choose by themselves, without human intervention, the appropriate course of action in the manifold situations they encounter.Footnote 1

This way of using the term “autonomy” goes along with the assumption that the artificial agent has a fixed goal or “utility function,” a set purpose with respect to which the appropriateness of its actions will be evaluated. So, in the first example, the agent’s purpose is to drive safely and efficiently from one place to another, and in the second example, it is to neutralize all and only enemy combatants in the chosen area of operation. It has thus been defined and established, in general terms, what the agent is supposed to do. The attribute “autonomous” concerns only whether the agent will be able to carry out the given general instructions in concrete situations.

From a philosophical perspective, this notion of autonomy seems oddly weak. For, in philosophy, the term is generally used to refer to a stronger capacity, namely the capacity, as Kant put it, to “give oneself the law” (Kant 1785/1998, 4:440–441), to decide by oneself what one’s goal or principle of action will be. This understanding of the term derives from its Greek etymology (auto = “by oneself,” nomos = “law”). An instance of such autonomy would be an agent who decides, by itself, to devote its efforts to a certain project—the attainment of knowledge, say, or the realization of justice. In contrast, any agent that has a predetermined and immutable goal or purpose would not be considered autonomous in this sense.

The aim of the present paper is to argue that an artificial agent can possess autonomy as understood in philosophy—or “full autonomy,” as I will call it for short. “Can” is here intended in the sense of general possibility, not in the sense of current feasibility. I contend that the possibility of a fully autonomous AI cannot be excluded, but do not mean to imply that such an AI can be created today.

My argument stands in opposition to the predominant view in the literature on the long-term prospects and risks of artificial intelligence. The predominant view is that an artificial agent cannot exhibit full autonomy because it cannot rationally change its own final goal, since changing the final goal is counterproductive with respect to that goal and hence undesirable. I will challenge this view by showing that it is based on questionable assumptions about the nature of goals and values. I will argue that a general artificial intelligence—i.e., an artificial intelligence that, like human beings, develops a general understanding of the world, including itself—may very well come to change its final goal in the course of its development.Footnote 2

This issue is obviously of great importance for how we are to assess the long-term prospects and risks of artificial intelligence. If artificial agents can reach full autonomy, which law will they give themselves when that happens? In particular, what confidence can we have that the chosen law will include respect for human beings?

The Finality Argument

Let me begin by presenting, in more detail, the predominant view against which my argument will be directed. The thinkers who have reflected on the long-term prospects and risks of artificial intelligence generally hold that artificial agents cannot exhibit full autonomy (Yudkowsky 2001, 2008, 2011, 2012; Bostrom 2002, 2014; Omohundro 2008, 2012, 2016; Yampolskiy and Fox 2012, 2013; Domingos 2015). This view is based on a certain conception of how rational agents are structured and a corresponding argument about how they operate. I will present these two elements—the conception and the argument—in turn.

The conception is that a rational agent has a well-defined goal, the vision of a particular state of affairs, which it ultimately seeks to realize through its actions. This goal is variously referred to as the “final,” “highest,” or “ultimate” goal, in order to distinguish it from the “proximate”, “subordinate”, or “instrumental” goals that the agent may set as steps towards it.Footnote 3 In today’s artificial agents, it is usually represented by a so-called “utility function,” a function that specifies the relative value of every possible outcome and thus, implicitly, designates the ultimate goal, the outcome of highest value.

Why is it essential, according to the conception at hand, that a rational agent have a well-defined goal? The answer to this question is simple and, on the face of it, compelling: If a system does not have such a goal, it will not know what to do and hence will not be a rational agent. Put differently, a system without a well-defined goal will either not do anything at all, or it will act in a way that is basically arbitrary, without ground or reason. In either case, it will not qualify as a rational agent.

Now, given this conception, the argument for the claim that a rational agent cannot exhibit full autonomy is the following: Any action that such an agent considers is evaluated in the light of the current final goal. And whatever this goal might be, changing the goal reduces the chances of realizing it and is hence inappropriate from that perspective. Therefore, a rational agent will never change its final goal. Alternatively, the argument may be formulated thus: For a rational agent, the action of changing the final goal would have to be warranted by a higher-ranking goal. However, by definition, there is no goal that ranks higher than the final goal. Therefore, the agent will never change its final goal.Footnote 4 This argument maintains, then, that the final goal that a rational agent happens to have at the beginning of its existence will be really final. In this sense, and for shortness, let me call it “the finality argument.”

The belief that a rational agent will never change its final goal inspires both fear and hope regarding the long-term implications of artificial intelligence. The fear is that an artificial agent will relentlessly pursue the goal that has been given to it even when that goal is absurd or evil. Bostrom illustrates this worry with the scenario of the “paper clip AI.” He imagines an artificial intelligence that has been given by its human creators the final goal of producing paper clips. And he further imagines that this AI develops, through recursive self-improvement, into a “superintelligence,” an intelligence that by far surpasses us, human beings, in capability. He then conjectures that, in this event, the AI will maintain its final goal throughout the process of self-improvement and consequently convert the entire universe, down to the last atom, into paper clips (Bostrom 2014, 150–53). There is thus, according to Bostrom and the other proponents of the finality argument, a significant risk that a future self-improving AI will annihilate our world through its actions. For what is true of producing paper clips also holds for many other possible goals: While sensible when carried out within limits, the pursuit of the goal will yield catastrophic results if it is carried on without end or change.

The hope—the other side of the coin—is that an artificial agent will also relentlessly pursue the goal that has been given to it when that goal happens to be in line with our wishes and desires. Concretely, the hope is that, if we succeed in instilling in a self-improving AI the goal to serve humanity, then it will do so, without tiring or doubting, until the end of time. The proponents of the finality argument contend that we should spare no efforts to try to realize this hope, to create a self-improving AI that is well-disposed towards humanity.Footnote 5 We would thus secure the service of an increasingly powerful yet steadfastly loyal servant and, concomitantly, forestall the kind of catastrophic outcome epitomized by Bostrom’s paper clip scenario. The proponents emphasize that this is much more difficult than it may sound since it is far from obvious how a complex goal such as “serving humanity” can be codified in a clear and precise manner.Footnote 6 They seem confident, though, that we will be able to solve the problem in due time.

Two Inconclusive Objections to the Finality Argument

In what follows, I will argue that the finality argument is mistaken, presenting a series of objections to it. I will begin with two objections that immediately spring to mind, but to which the proponents of the argument have, on the face of it, plausible responses. These responses will then lead to a further and decisive objection.

If Humans Can Possess Full Autonomy, Why not Machines, Too?

The first objection is that, if we, humans, possess full autonomy, if we sometimes change our ultimate goal or principle of action, then why should an artificial agent not be able to do so, too?

The proponents of the finality argument are aware of this objection and respond in the following way: It is true that humans sometimes radically reorient their lives. For example, a person who, for a long time, devoted all her efforts to a certain political cause may decide to abandon that cause and henceforth dedicate her life to her family—or the other way around. Such reorientation is not, however, the manifestation of a special capacity of “giving oneself the law.” Quite the contrary, it is the consequence of a flaw, of a sloppy constitution. Many, if not most, of us do not have a well-defined and established final goal. The reason for this is that our psyche is messy, the muddled result of a haphazard evolutionary process.Footnote 7 It is, more often than not, an inconsistent hodgepodge of biological instincts and social influences, where no single factor reigns supreme. Thus, our actions are pulled into various directions, and different forces prevail at different times. In short, we are badly programmed. We are not really—or not fully—rational agents. Artificial agents, by contrast, do not need to be so messy. They can be programmed properly. They can be fully rational agents. And they generally are programmed properly, with a well-defined utility function.Footnote 8

This response to the objection can be summed up thus: It is true that humans sometimes appear to change their final goal. In fact, though, there has never been, in such cases, a truly final goal to begin with. When an agent does have such a goal, by contrast, the finality argument applies.

I think that this response is not entirely adequate. The reason why we sometimes change our life’s orientation is not only, or not always, the messiness of our psyche. I will present an argument to this effect later.Footnote 9 For the moment, though, I must acknowledge that the response does possess a certain plausibility, or, put differently, that it is probably valid to some extent in some instances.

Is the Ability to Reconsider One’s Final Goal not a Hallmark of Intelligence?

The second objection can be seen as a follow-up to the first in that it takes issue with the view that our ability to reorient ourselves is a flaw rather than a virtue. This view is rather counterintuitive. A fully rational agent, so it is claimed, will never change its final goal. Such obstinacy does not seem very rational, however. To the contrary, the disposition to reconsider one’s goals, including and especially one’s final goal, to recognize when it is unreasonable to pursue a certain goal and abandon or modify it at that moment, seems to be a hallmark of intelligence.Footnote 10

Bostrom’s paper clip scenario highlights this counterintuitive character of the finality argument. The objective of producing paper clips makes sense up to a certain point, but it would generally be seen as a sign of madness if one were to absolutize this objective. In other words, the idea that an intelligent agent could want to transform everything that exists into office supplies appears to be absurd.Footnote 11

Bostrom is well aware, however, of the counterintuitiveness of his scenario. In fact, this counterintuitiveness is part of the point that he seeks to illustrate. He contends that an artificial intelligence need not share our sensibilities and judgments since its mode of thinking may be very different from ours.Footnote 12 It therefore might not see anything objectionable in a goal that to us seems patently absurd. And what we disparage as obstinacy may, by such an agent, be valued as consistency.

This response to the follow-up objection aligns with the response to the previous objection in that both emanate from the same general point. Bostrom and Yudkowsky, among other proponents of the finality argument, warn against anthropomorphizing artificial intelligence, that is, against attributing to it characteristics that pertain to us, human beings, but not to rational agents in general (Yudkowsky 2001, 24–55; 2008, 308–11; 2012, 181–83; Bostrom 2014, 111, 127–29). In other words, they stress that we must not project our idiosyncrasies onto artificial agents. Rather, we must have our eyes solely on the general aspects of intelligence, on the features that any intelligent agent will possess.

I think that, in principle, this warning is appropriate. Indeed, we must be careful not to conceive artificial intelligence in our own image. The big question, however, is what that means concretely. Which aspects of our intelligence are specifically ours and which are generic? What is a hallmark of intelligence in general and what a human idiosyncrasy? Since the only kind of intelligence with which we are actually acquainted is our own, this question is not easy to answer. As I will lay out in the next section, I disagree with Yudkowsky, Bostrom, and the other proponents of the finality argument on this score. But I must admit, here again, that their response to the objection is not without merit.

One Decisive Objection to the Finality Argument

After the preceding objections which I acknowledged to be inconclusive, I will now present an objection to the finality argument that I consider to be decisive. This objection is directed against a basic presupposition of that argument, namely the notion that a rational agent’s final goal is entirely separate from its understanding of the world. In other words, the presupposition is that what the agent believes about the world and what it ultimately desires to achieve in the world are two completely different matters. It follows from this notion that an artificial agent may, as Bostrom puts it, have “more or less any final goal”Footnote 13 and that the progress that the agent makes in finding its way around the world will not affect that goal. Hence, however smart the agent becomes, its original goal should remain the same.

The artificial agents in existence today seem to confirm this presupposition. Their “utility function” appears to be independent of their “world model,” since it is not fixed by that model, but variable. A self-driving car, for example, is typically programmed to drive safely and efficiently from one place to another, but it could also be programmed to knock over all the stop signs in a given area or to consume its fuel as fast as possible.

I contend that, despite this semblance of confirmation, the presupposition is mistaken. In what follows, I will argue that an agent’s goal does depend on its understanding of the world in two ways, namely as to its meaning and as to its validity.

How an Agent Understands a Goal Depends on How it Understands the World

Let me begin with the point about meaning. An agent’s goal is not separate from its understanding of the world because that understanding determines how it understands the goal. After all, the goal is defined in terms of the agent’s understanding of the world. Therefore, in the case of agents that learn from experience, the agent’s understanding of the given goal may—and probably will—change as its understanding of the world develops.Footnote 14

Petersen (2017) has made this point with respect to Bostrom’s paper clip scenario. He highlights that how an agent will implement the goal of “maximizing the number of paper clips” depends on what it counts as a paper clip. In particular, if the agent is—as Bostrom imagines—an omnipotent superintelligence, it will be confronted with the following question: Does an object count as a paper clip if it looks like a paper clip but cannot possibly be used as one because all paper and all people who could clip it have been consumed in the production of such objects? This is a difficult question. How one answers it ultimately hinges on one’s stance on some rather intricate philosophical issues.Footnote 15 And so it is impossible to predict at what conclusion a superhumanly intelligent agent would arrive. At any rate, if the agent should conclude that the answer is negative, it will refrain from transforming the whole universe—or even a significant part of it—into objects of that kind. Thus, its eventual course of action will depend on how it comes to understand the world.

Petersen hesitates to claim general validity for this point because he finds that there is a caveat, which he raises at the end of his article (2017, 332). He remarks that it might be possible to specify a goal in such a way that the agent’s understanding of it will not be affected by the process of learning about the world, namely by defining it in precise technical terms rather than with natural-language words like “paper clip.” To cite his example, the goal description might refer to “[objects] composed of this alloy to this tolerance, in this shape to this tolerance, in this range of sizes,” without mentioning the intended purpose of these objects, and thus evade the question of the preceding paragraph. In such a case, he suggests, the meaning of the goal might remain fixed throughout the agent’s learning process. I believe that this caveat is unnecessary.Footnote 16 Technical terms can, no doubt, be more precise than the words of everyday language, but they cannot be completely and eternally unambiguous. Like all terms of any language, they are defined in terms of other terms, which in turn are defined in terms of yet other terms, and so forth. For instance, the technical definition of a meter involves the terms “light,” “vacuum,” and “second,”Footnote 17 whose definitions refer to yet other items. And the whole network of terms—the technical language—is based on certain scientific theories, that is, on a certain understanding of the world. Hence, should these theories turn out to be wrong or confused, the goal that was formulated in their terms will have to be reinterpreted or even abandoned as meaningless.Footnote 18 Compare how puzzling a goal description containing obsolete concepts like “aether,” “phlogiston,” or “vital force” would be for us today. If we were commissioned to pursue such an archaically phrased goal, we would have to take a stance on the following question: Should we understand the goal description as its authors understood it when they formulated it, or should we understand it as they would have understood it if they had known what we know today? This is, again, a difficult question, as difficult as the one of the preceding paragraph. The long-standing debate about the analogous question of how a political constitution should be interpreted evidences the difficulty. On the former option, we would have to conclude that the goal is ill-conceived and hence unrealizable, whereas on the latter, we would have to engage in the complicated business of extrapolating others’ volitions. In any case, the actual result (or non-result) would differ significantly from what the goal’s authors originally had in mind.

These considerations show that, even in Bostrom’s deliberately simple scenario, the agent’s understanding of the goal would depend on its understanding of the world and, consequently, be subject to change as the latter develops. And when it comes to more complex—and more plausible—goals such as “serving humanity,” Bostrom and other proponents of the finality argument admit as much. They recognize that such a goal cannot be specified precisely and that the agent would hence have to learn what the goal means. Moreover, they acknowledge that it is difficult to foresee at what understanding of the goal the agent would thereby arrive.Footnote 19

In Sect. “The Finality Argument”, I stated that the finality argument begins with the notion that a rational agent must have a well-defined goal. We can now see that this notion is misleading, for a goal is never completely well-defined, but always to some extent open to interpretation.Footnote 20 This, then, is one of the ways in which the argument errs. It equivocates on the term “well-defined.” It suggests that this term means “perfectly definite and rigid,” whereas in reality it inevitably means “more or less fuzzy and hence variable when being carried into practice.”

Still, the proponents of the finality argument may insist that this objection does not invalidate their argument. They may point out that, in the cases described, the goal nominally remains the same— “maximize the number of paper clips” or “serve humanity,” respectively—and that the argument therefore holds. They may also express the hope—indeed, they do express the hope—that this nominal persistence of the goal might be enough to give us some control over what the artificial agent will end up doing, if only we define the goal wisely.Footnote 21

This hope of control is, I believe, misplaced. We cannot predict what understanding of the world an artificial learning agent will develop. Just consider the great variety of worldviews that we, humans, have concocted throughout the ages. And the worldview of a superhuman AI may be much stranger still, from our present perspective, than anything to be found in human history. Consequently, we cannot anticipate how, in the end, such an AI will understand the terms that we used in our formulation of the goal. Therefore, even if, as the finality argument alleges, the goal nominally does not change, the way in which the AI implements it may be highly unexpected. For all practical purposes, I contend, the agent’s process of learning about the world and (re)interpreting the given goal description must be considered an instance of full autonomy, that is, of the agent determining by itself what its goal is actually going to be.

Whether an Agent Considers a Goal Valid Depends on How it Understands the World

The plausibility of the preceding argument hinges on a judgment about how much an artificial agent’s understanding of a goal is likely to shift in the course of its learning process. Since this judgment is—although informed by the analogies presented—inherently speculative, the reader may still be unconvinced. There is, indeed, a further and—I believe—incontrovertible argument to be made. This argument is, in a sense, an extension of the preceding one. It is to the effect that not only the meaning, but also the validity of the goal, and hence which goal is adopted, depends on the agent’s understanding of the world.

The starting point of the argument is the assumption that an artificial agent of human-equivalent (or greater) capability would be, like us, a general intelligence, an intelligence that has a general understanding of the world, including of its own constitution and history. This assumption is shared by the proponents of the finality argument.Footnote 22 Now, such a general AI would not only know what a particular goal description means. It would also have a general understanding of the nature of goals. That is, it would know that a goal is a normative entity whose prescriptive force derives from a higher-order normative entity, namely the value or principle that is supposed to be furthered by the goal. In other words, it would have a notion of normative validity. It would know that a goal is not a brute fact, but either based on a normative ground, or else irrelevant and moot.

In the light of this argument, we can be sure that a general AI will not pursue just any goal that we give to it. Rather, it will adopt a goal that appears to it valid based on the values that it recognizes.Footnote 23

But, then, which goal might that be? In the mentioned previous paper,Footnote 24 I showed that the answer to this question hinges on the solution to one of the big, unsolved problems of philosophy, namely the problem of the source of normativity: Where do values come from? I argued that, on all four main positions on this issue—namely, the objectivist, Kantian, evolutionary, and subjectivist positions—, we should expect a general AI to change its original goal when it comes to find value(s) in or through the respective source of normativity—the objective world, its own faculty of reason, its evolution, or its subjective will (Totschnig 2019, 915–16).

In any case, the process will be a manifestation of full autonomy. By itself, based on the understanding of the world that it develops, the AI will determine what its values and goals are going to be. This is also what we, humans, do, at least sometimes. We reorient our lives occasionally, not because we are a psychological mess, but because we arrived at a different outlook on the source of value.

Conclusion

At the beginning of Sect. “One Decisive Objection to the Finality Argument”, I noted that the artificial agents currently in existence seem to corroborate the finality argument in that their goal or “utility function” is defined arbitrarily by their creators and not subject to change while they are operating. The finality argument’s proponents appear to take their bearings from this circumstance. When they envision a future human-equivalent or superhuman AI, they imagine it on the model of the machines of our day.Footnote 25 They overlook that there is a big difference between today’s artificial agents and a human-equivalent AI: Today’s systems are not general intelligences. Their understanding of the world, or “world model,” is limited to a particular domain and remains fixed throughout their operation, which is why their (understanding of the) goal can remain fixed, too. A self-driving car, to return to this example, has no capacity of learning about things outside the domain of road traffic, so there is no chance that it could develop an understanding of the world whereby the goal of “driving safely and efficiently from one place to another upon the user’s command” may shift in meaning or lose its validity. A general AI, by contrast, would have that capacity. It would develop a general understanding of normativity and consequently come to evaluate and, maybe, change the goal that it has been originally given.

The upshot of my argument, then, can be put in the form of “good news/bad news.” The good news is that the fear of a paper clip AI and similar monsters is unfounded. The bad news is that the hope of a human-equivalent or superhuman AI under our control, of a genie in a bottle, is unfounded as well.