Keywords

1 Introduction

Bots are software agents that operate in digital virtual environments [1, 2]. An example scenario would be a “user-like” bot that could access web platforms as a user and behave like a human user. Ideally, such a bot could autonomously sense and understand the platforms’ affordances. Affordances in digital spaces are, for example, interaction possibilities and functionalities on the web, in software services, or on web application platforms [3, 4]. The bot would recognize and understand the differences and variability between different environments. If the platform or service has extensions to devices or bodies, as in the Web of Things (WoT), it would also have control or possibilities to interact with an outer web or service application world. Furthermore, a bot can be independent of a specific platform. A user-like social bot, for instance, would be able to recognize and understand social networks and act to influence or engage in belief sharing on any social platform. It would also adjust with the changes and uncertainty of the affordances in a specific digital environment, such as when hypermedia interactivity features and functionalities change. Such a bot could also learn and develop to drive its goals and intentions from these digital microenvironments and take goal-directed targeted action to achieve them [5]. Such bots could also communicate and cooperate with other user agents, humans or bots, to collaborate and socialize for collective understanding and behavior.

The example scenarios described above convey the problems of perception and action in bots, similar to how a human user would perceive and act in digital spaces. To date, bots are incapable of the essential cognitive skills required to engage in such activity since this would entail complex visual recognition, language understanding, and the employment of advanced cognitive models. Instead, most bots are either conversational interfaces or question-and-answer knowledge agents [1]. Others only perform automated repetitive tasks based on pre-given rules, lacking autonomy and other advanced cognitive skills [6, 7]. The problems are, therefore, complex and challenging [8, 9]. Solutions must address different areas, such as transduction, autonomous action, and reasoning cases, to realize advanced generalizable intelligent behavior [10].

Problems spanning diverse domains require architectural solutions. Accordingly, these challenges also necessitate that researchers address the structural and dynamic elements of such systems from an architectural perspective. [11,12,13]. For this reason, this paper aims to outline the architectural research agendas to address the problems toward conceptualizing and developing a cognitive bot with generalizable intelligence.

The paper is divided into sections discussing each of the research challenges. In Sect. 2, we discuss the challenges related to efforts and possible directions in enabling bots to sense and understand web platforms. Next, Sect. 3 describes the challenges related to realizing advanced cognitive models in software bots. Section 4 and 5, discuss the research issues in bot communication and cooperation, respectively. The remaining two sections provide general discussions on bot ethics and trust and conclude the research agenda.

2 The Transduction Problem

Web platforms can be seen as microenvironments or as distinct nature of digital microcosms [14]. They offer a microhabitat for their users’ diverse digital experiences. These experiences mainly transpire from the elements of interaction and action, the hypermedia within web environments [14, 15]. Hypermedia connects and extends the experience, linking to further dimensions of the web-worlds, which means more pages and interactive elements from the user’s perspective. Analogous to the biological concept of affordances from environmental pyschology [16], the interaction elements are considered affordances in the digital space [3, 4]. Similarly, signifiers can also accompany affordances. Signifiers reveal or indicate apparent possibilities for actions associated with affordances [4, 17]. An example on the web would be a button affording a click action and a text signifier hinting “Click to submit”. A human user understands the web environment, its content or affordances, and navigates reasonably easily. However, enabling software bots to understand this digital environment and its affordances the way human users do is a challenging task. It is a complex problem of translating and mapping perception to action, i.e., the transduction problem [18, 19].

Today, there are different approaches to these challenges. The first category of approaches depends on providing knowledge about the environment for different levels of observability using APIs or knowledge descriptions. With API-based approaches, bots are developed for a specific platform, constantly putting developers in the loop. Bots do not have the general perceptual capability to understand and navigate with autonomous variability. Other architectures in this category, originating from the Web of Things (WoT), attempt to address the challenge using knowledge models and standards that could enable agents to perceive the web by exposing hypermedia affordances and signifiers [3, 20]. The knowledge descriptions carry discoverable affordances and interpretable signifiers, which can then be resolved by agents [3, 4]. This approach might demand extended web standards that make the web a suitable environment for software agents. It might also require introducing architectural constraints that web platforms must adhere to in developing and changing their platforms, such as providing a knowledge description where bots can read descriptions of their affordances.

The second category of approaches focuses on using various behavioral cloning and reinforcement learning techniques [21]. One example is by Shi et al. [22], where they introduce a simulation and live training environment to enable bots to complete web interaction activities utilizing keyboard and mouse actions. Recent efforts extend these approaches by leveraging large language models (LLMs) for web page understanding and autonomous web navigation [23, 24]. The results from both techniques and similar approaches reveal the size of the gap between human users and bots [22, 23]. Both approaches still need to solve the problem of variability and generalizability of perception and action. Though approaches that leverage the hypermedia knowledge of platforms with affordance and signifier descriptions could serve as placeholders, real bots with generalizable capabilities would need more autonomous models yet.

Besides this, some design assumptions consider the environment and the bot as one. As a result, they may attempt to design agents as an integrated part of the platforms or try to ‘botify’ and ‘cognify’ or orient web services as agents. However, alternatively, the whole notion of a user-like bot inherently considers the bot to have an autonomous presence separate from the web platforms it accesses. Figure 1 illustrates the basic perspective in a vertically separate design and development of the bot and the web platforms it operates in. This strict separation enables both the environment and the bot to evolve independently.

Fig. 1.
figure 1

A decoupled bot-environment and bot-behavior (left) viewpoint.

3 The Behavior Problem

Most user activities on digital platforms are complex behaviors resulting from human users’ underlining intentions, goals, and belief systems. Although a bot operating in digital spaces need not fully emulate humans to achieve generalizable behavior, it is essential to consider the intricacies and sophistication of human users’ behavior on the web as an example [25]. To that end, engineering bots with similar behavior models might take account of existing approaches to measuring generalizable user behavior while not having to replicate human cognition as such [26].

Fig. 2.
figure 2

The abstraction ladder in modeling machine intelligence.

Current models for engineering intelligent behavior come from three prospective categories of approaches. Each approach takes natural or human intelligence as its inspiration and models it at different levels of abstraction. The three methods differ mainly in how they try to understand intelligence and where they start the abstraction for modeling intelligence. Figure 2 illustrates this ladder of abstraction in modeling machine intelligence. The abstractions start either at artificial cognition, artificial neurons, or artificial life or consciousness [10, 27]. These relate to models and techniques for each approach to enacting intelligent behavior based, respectively, on high-level cognitive functions, artificial neural networks (ANNs), or more physical and bottom-up approaches starting at molecular or atomic levels.

Artificial Cognition: in cognitive modeling, efforts to model cognition are inspired by the brain’s high-level cognitive functions, such as memory. Abstraction is at the topmost level compared to the other two. Most assumptions come from studies and understanding of the cognitive sciences. Cognitive models use diverse techniques, such as production rules, dynamical systems, and quantum models, to model particular cognitive capabilities [28]. Although cognitive models use methods from other approaches, such as ANNs, they do not necessarily adhere to underlying mechanisms in the brain [10, 29]. Promising experimental research examples that heavily rely on artificial cognitive models, aka cognitive architectures, are works such as the OpenCog (Hyperon) and the iCub project [10, 29].

Artificial Neurons: artificial neurons as brain models aim to understand, model, and simulate underlying computational mechanisms and functions based on assumptions and studies from the neurosciences [30]. Discoveries from neuroscience are utilized to drive brain-based computational principles. Sometimes, these approaches are referred to as Brain-derived AI or NeuroAI models [31,32,33]. Due to the attention given to the underline principles of computation in the brain, they strictly differ from the brain-inspired cognitive models. Practices are mainly advancements in artificial neural networks, such as deep learning. Large-scale brain simulation research and new hardware in neuromorphic computing, such as SpiNNaker and Loihi, also contribute to research efforts in this area. Computational capabilities in neuromorphic computing enable particular types of neural networks closer to brain computational principles, such as Spiking neural networks [31, 34]. Brain-derived AI approaches, with neurorobotics, aim to achieve embodiment using fully developed morphologies, either physical or virtual models. The Neurorobotics Platform (NRP) is an example of such efforts to develop and simulate embodied systems. The NRP is a neurorobotics simulation environment to connect simulated brains to simulated bodies of robots [35].

Artificial Life (aLife): Attempts to model consciousness, particularly in the case of artificial life, start with a bottom-up approach at a physical or molecular level [27]. Most synthesizing efforts to models of intelligence in artificial life are simulations with digital avatars.

However, in the context of bots on web platforms, employing similar integrated behavior models to those mentioned above is still a challenge. Thus, in addition to the proposed separation of the bot and environment, decoupling a bot’s basic skeleton and behavior models is architecturally significant. Figure 1left, illustrates the separate structure of a bot and its behavior models. For instance, the bot’s core skeleton might have sensory and interaction elements as virtual actuators that enable its operation using the keyboard and mouse actions. The vertical separation allows behavior models and bot skeletons to change independently, maintaining the possibility of dynamic coupling.

4 Bot Communication Challenges

In Multi-Agent Systems (MAS), agent-to-agent communication heavily relies on agent communication languages (ACLs) such as FIPA-ACL, standardized by the Foundation for Intelligent Physical Agents(FIPA) consortium [18, 36,37,38]. However, in mixed reality environments, where bots and humans share and collaborate in digital space and beliefs, communication cannot rely only on ACLs and APIs [39].

To that end, a cognitive bot with artificial general intelligence (AGI) must possess communications capabilities to address humans and software agents with diverse communication skills. Communication capabilities should include diverse possibilities like email, dialogue systems, voice, blogging, and micro-blogging.

Large language models (LLMs) have recently shown significant progress in natural language processing and visual perception that could be utilized for bot and human communication [23, 24].

5 Integration and Cooperation Challenges

Researchers assert that the grand challenge in AGI remains in integrating different intelligence components to enable the emergence of advanced generalizable behavior or even collective intelligence [10, 40,41,42]. The intelligence solutions to integrate include learning, memory, perception, actuation, and other cognitive capabilities [43]. Theories and assumptions developed by proponents include approaches based on cognitive synergy, the free energy principle, and integrated information theory [5, 41, 42].

In practice, however, integration and cooperation of software agents are implemented mainly by utilizing methods such as ontologies, APIs, message routing, communication protocols, and middleware like the blackboard pattern [18, 44, 45].

Fig. 3.
figure 3

Representation of integrated parts, i.e., bots, shared behavior models, and the web environments.

From a software engineering perspective, basic architectural requirements for the context of bots operating on digital platforms are possibilities for the evolvability of bots into collective understanding with shared beliefs, stigmergy, or sharing common behavior models to learn, transfer learned experience, and evolve. Other concerns are the hosting, which could be on a cloud or individual nodes, scaling, and distribution of bots and their behavior models.

Figure 3 shows a simple diagram representing the integrated parts, i.e., bots, shared behavior models, and the environment. B represents the possible number of bots. BM represents the shared and individual behavior models. E represents the web environment and its variability. The lines represent communication channels. H denotes the human users that participate and share the digital space.

6 Bot Ethics and Trust

Concerns and challenges in AGI are diverse. They touch on various aspects of society and politics and have real-world implications, such as the impact of user-like bots on privacy, security, ethics, and trust with humans [46,47,48]. User-like bots, emulating human users’ perceptual and interaction techniques, can easily pass bot detection tests and risk exploitation for malicious use cases to deceive and attack web platforms. They could also extend their perceptual capabilities beyond the web with connected devices such as microphones and cameras, affecting the personal privacy of others. Possible threats include spamming, cyberattacks to overwhelm platforms, and even unfair use of web platforms for surveillance or illicit financial gains. In WoT context, for instance, bots could affect smart factories and automated services in the real world, compromising physical devices and processes with significant security implications [49].

Hypothetically, intelligent social bots could share their beliefs on social platforms similar to or better than any human user, with superb reasoning and argumentation skills. These cases could negatively impact society by exposing people and software agents to unexpected, misaligned norms and moral beliefs. Furthermore, deploying advanced cognitive bots as digital workforces may result in unforeseen negative economic consequences. Short-term issues could include unemployment, while long-term concerns may involve ethical dilemmas surrounding bot ownership rights, bot farming, or ‘enslavement’ [46]. Accordingly, these ethical concerns may affect the legality of cognitive bot development, potentially impeding their engineering and deployment. Alternatively, this could introduce new legal aspects regarding regulation, standards, and ethics for integrating and governing bots within emerging socio-technical ecosystems [49].

Despite these concerns, bots’ current and potential applications can positively impact numerous aspects of society. Cognitive automation, for example, is driving increased demand for cognitive agents in Industry 4.0, digital twins, and other digital environments [6, 7, 50]. Early implementations, like Wikipedia bots, already play a significant role in fact-checking and other knowledge-based tasks. On platforms like GitHub, bots assist and automate development tasks [51, 52]. Future cognitive bots could also benefit society by participating in knowledge processing and providing valuable new scientific insights, such as medical advancements, which significantly outweigh their potential risks.

Today, digital platforms handle simple crawling and API-based bots with crawling policies and controlled exposure of APIs. However, advanced user-like bots like the ones envisioned in this report will require more complex mechanisms to govern and control their behavior and belief-sharing [46, 49]. One approach towards this is ethics and trust by design, which recommends protocols and policies for developers and engineering organizations to incorporate trust models and ethical frameworks at the design and architectural stages [46]. Another approach proposes norms and user policies with penalties for agents to acknowledge, understand, and adhere to, similar to what human users would do on digital platforms [49, 53]. In return, norm and value-aware bots could establish participation, trust, and compliance while facing the consequences of noncompliance. They may also contribute to revising and creating collective values and norms, possibly becoming part of viable socio-technical ecosystems [49, 54].

However, ensuring safety and trust in such ecosystems will require diverse approaches. In addition to providing machine-readable norms and policies targeting cognitive agents, it is essential to tackle ethical and trust issues with transparent and explainable design and engineering processes at each stage. For instance, the European Union (EU) recommends a three-phase human intervention approach at the design phase, at the development and training phase, and at runtime with oversight and override possibilities [55]. As a result, research on developing advanced cognitive bots must also address critical challenges in engineering trustworthy, secure, and verifiable AGI bots employing hybrid approaches.

7 Conclusion

The study presented architectural research challenges toward designing and developing a new line of user-like cognitive bots operating autonomously on digital platforms. Key challenges, such as the transduction problem, are discussed in the context of digital web platforms’ access, user-like visual interaction, and autonomous navigation. Bot-environment separation for autonomy and bot skeleton and behavior model separation for better evolvability are also discussed as architectural recommendations.

Furthermore, challenges in enacting generalizable behavior, communication, and cooperation are also presented. Finally, challenges on cognitive bots’ ethics, trust, implications, and future impacts are also discussed as part of architectural concerns.

As an outlook, a good starting point for future work would be to conceptualize a detailed implementation architecture and construct a software bot by utilizing existing cognitive models. These systems can demonstrate the concept and allow further detailed analysis through empirical data and benchmarks.