1 Introduction

The recent availability of personal devices with built-in video cameras, such as smartphones, high-resolution photo cameras, action cameras, and other wearables, has enabled users to capture, store, and share their everyday life in the form of pictures and videos on social web platforms. Every minute, 500 hours of video are being uploaded to YouTube and over 1 billion hours of YouTube videos are watched each day [123]. The advent of wearable video cameras that can be attached to clothes, such as Google Clips [38] or Narrative Clip 2 [86], as well as of smartglasses with built-in video cameras, such as Spectacles by Snapchat [124], has generated an amplifying effect of the phenomenon known as “lifelogging,” the new form of personal big data [39]. Consequently, the practice of video recording and sharing one’s life and visual reality and experiences has created the need for advances in video streaming technology to support such user behavior [113, 133], but also created new topics in need of scientific investigation that the community has started to address, such as video literacy, privacy, and the impact on bystanders of the use of video recording devices in public places [54, 55, 57,58,59].

Mobile and wearable devices are also being increasingly used for Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR) applications [70], in which the physical world intertwines the virtual [9, 10, 15] to create novel interactive experiences for users. Also, applications of Mediated Reality [67] delivered by smart eyewear devices enable their users to perceive new facets of the physical reality, such as physical phenomena taking place outside the visible spectrum [68], while applications in Alternate Reality [19] (AltR)Footnote 1 use multimedia technology to engage users into new experiences, such as “living” other people’s stories and interacting in different places, times, situations, and contexts. At the same time, a new type of ubiquitous mixed reality has emerged from the fusion of networked sensor and actuator infrastructure and shared online virtual worlds, Cross-Reality [94].

We believe that it is only a matter of time before lifelogging and cross-reality technology used in conjunction will provide practical means for users to express themselves in new ways, share their everyday life activities, and connect to remote audiences in creative ways for broadcasting and consuming AltR realities. In this work, we explore such aspects at the intersection of lifelogging, cross-reality, and multimedia alternate reality technology, and we focus on the specific topic of broadcasting personal visual realities to a remote audience. Figure 1 shows an illustration of the concept: visual reality corresponding to the user’s field of view captured by the video camera embedded in a smart device, such as a pair of smartglasses, and processed by lifelogging [39], life abstraction [4], augmentation [15], and/or mediation [67] is streamed to a video server, from where it is broadcast to a remote audience. The audience may consist of friends, family members, co-workers, peers, or any followers [111, 134] recognized and authorized by the broadcaster. The members of the audience may experience the broadcast as a video stream, audio narration, or even haptic feedback connected to specific events that occur in the life of the broadcaster. The palette of design possibilities for systems implementing broadcasting and consumption of personal visual realities supported by lifelogging devices, cross-reality ubiquitous sensor and actuator infrastructure, and AltR modalities is large with many opportunities for innovation. In this context, it is important to analyze, structure, and discuss design possibilities for such applications. Our practical contributions are:

  1. 1.

    We introduce the “Alternate Reality Broadcast-Time” matrix (AltR-BT) in the form of a conceptual space to characterize the possibilities for sharing personal visual realities with remote audiences. In this space, we identify the broadcast (what is shared) and the time of broadcasting (when sharing occurs) as two important dimensions for broadcasting personal visual realities to third party viewers for systems positioned at the intersection of lifelogging, AltR, and cross-reality technology.

  2. 2.

    We describe three prototypes that implement design options from the AltR-BT space: (1) video streaming using Wi-Fi camera glasses using established protocols, e.g., HLS and DASH; (2) video streaming of mediated and augmented vision implemented with the HoloLens Head-Mounted Display (HMD) and YouTube Live; and (3) broadcasting of life abstraction in the form of concepts automatically detected from the video captured by camera glasses, which are rendered visually, aurally, and as vibrotactile feedback to the members of the remote audience.

As technology advances in terms of cross-reality infrastructure of ubiquitous sensors and actuators, new concepts emerge in AltR realities, and lifelogging devices become prevalent, we are about to see more applications that fall at the intersection of these areas that will create new opportunities for self-expression for users and new ways to connect to remote audiences. Our AltR-BT conceptual space will help guide such developments, informing researchers and practitioners about the design possibilities for broadcasting and consumption of personal visual realities.

Fig. 1
figure 1

A diagram overview of broadcasting personal visual realities to remote audiences. Notes: In this illustration, streaming and broadcasting include, but are not limited to, video content. For example, concepts can be automatically extracted from first-person video and narrated to the audience wearing smart buds; see our broadcast-time matrix in Fig. 3

2 Related work

We relate to prior work in computer-generated, mediated, and cross reality, overview established protocols for video streaming over the Internet, and discuss prior work on smartglasses with embedded video cameras and video streaming applications, including the lifelogging phenomenon.

2.1 Computer-generated and computer-mediated realities

Prior work has addressed many facets and practical aspects of the ways in which physical reality can be captured, processed, and rendered in modified forms and formats to users. These computer-generated, computer-mediated, and computer-enhanced realities include Virtual Reality [51], Augmented Reality [9, 10], Mixed Reality [77, 78], Mediated Reality [67], Multimediated Reality [68], Alternate Reality [19], and Cross-Reality [94], to name a few of the most common forms of hybrid physical-virtual realities that are relevant to our scope of investigation.

We are interested in such hybrid realities from the point of view of the visual perception they create for their users, in relation to which we discuss the opportunity of broadcasting personal visual realities to remote audiences. To this end, we overview definitions of the various forms of computer realities by focusing on visual perception. For example, Pausch et al. [96] adopted a visual perspective for defining VR as “any system that allows the user to look in all directions and updates the user’s viewpoint by passively tracking head motion” (p. 13), while Mann et al. [68] provided a shorter definition referring to systems that replace the real world with a virtual one, and introduced a classification system for many types of realities, including mediated, augmediated, and multimediated. Azuma [9] introduced a practical definition of AR by enumerating the characteristics of systems that combine the real and the virtual, are interactive in real time, and operate in 3-D. Regarding MR, definitions are still divided among experts [130], although Milgram et al. [77, 78] distinction between Augmented Reality and Augmented Virtuality (AV) as instances of Mixed Reality in the Reality-Virtuality Continuum seems to be the preferred definition, i.e., “display systems [that] provide users with the opportunity to move back and forth between real world and virtual world scenes” [77] (p. 19) or, respectively, a “mix of real and virtual worlds in various proportions,” according to Mann et al. [68] (p. 1). Speicher et al. [130] concluded that MR “can be many things and its understanding is always based on one’s context” (p. 12).

Mediated Reality (X-Y.R) goes a step further with respect to VR, AR, MR by mixing, blending, and modifying reality [68], such as by enabling users to filter out undesirable aspects of the visual world. A special application of X-Y.R is “diminished reality” [67], where objects from the background are removed or replaced so that visual attention can focus on a subset of objects of interest; see Zokai et al. [149] for examples. In this regard, Mori et al. [81] described diminished reality by means of a “set of methodologies for concealing, eliminating, and seeing through objects in a perceived environment in real time to diminish the reality” (p. 1). When mediation and augmentation occur simultaneously, the result is an augmented mediated reality or, for short, Augmediated Reality [50, 68], which Scourboutakos et al. [115] characterized as “an experience where by the means of a system technology, people are able to seamlessly modify (mediate) their perception of reality, while it is also being augmented” (p. 752).

2.2 Cross-Reality and Alternate Reality

Paradiso and Landay [94] introduced Cross-Reality as the union between ubiquitous sensing and actuator technology and shared online worlds. Within Cross-Reality, “sensor networks can tunnel dense real-world information into virtual worlds” and “interaction of virtual participants can incarnate into the physical world through a plenitude of diverse displays and actuators” [94] (p. 14). Although Paradiso and Landay did not provide an acronym for Cross-Reality, Mann et al. [68] referred to it using XR-like notations, while also providing context and discussing the history of the “X” in “XR.” According to Mann et al., the “X” has been used to refer to extrapolation, interpolation, and extension of the physical world. To prevent confusion of the uses of the acronym and clarify such matters, Mann et al. [68] provided three definitions for XR realities that received distinct names: Type 1a (or XR1a), Type 1b (XR1b), and Type 2 (XR2) realities, respectively. In Type 1 worlds, “X” is a mathematical variable that denotes either extrapolation past the physical reality (a Type 1a or XR1a reality) or interpolation in Milgram’s [77, 78] sense of the reality-virtuality continuum (a Type 1b or XR1b reality). In Type 2 worlds, “X” denotes Paradiso and Landay’s [94] “cross” prefix from “cross-reality.” Another use of the XR acronym has been the umbrella term “Extended Reality” to cumulatively denote VR, AR, and MR realities. In fact, the community has been using XR to denote both Cross-Reality  [26, 48, 84, 85, 118, 126] and Extended Reality  [20, 69, 71, 91, 143]. To keep clear from such confusions, we use in this work the acronym XR2, following the taxonomy and notations of Mann et al. [68], to refer to the cross-reality concept of Paradiso and Landay [94] relevant for the scope of our investigation.

Another type of reality relevant for our work is Alternate Reality (AltR). According to Chambel et al. [19], an alternate reality denotes “different spaces, times or situations that can be entered thanks to multimedia contents and systems, that coexist with our current reality, and are sometimes so vivid and engaging that we feel we are living in them ... immersive experiences that may involve the user in a different or augmented world, as an alternate reality.” Thus, AltR goes beyond the focus of VR, AR, and MR by addressing new forms of media and delivering new kinds of immersive experiences to users. Consequently, of all the various forms of combining the real and virtual, we relate to XR2 and AltR the most since our scope addresses broadcasting personal visual realities that implicitly require the support of ubiquitous computing technology (e.g., networked infrastructure, communications protocols, video streaming technology, wearable devices) and multimedia techniques, including interaction techniques, to deliver to a remote audience the visual experience of the broadcaster. Figure 2 illustrates the scope of our work as a Venn diagram adapted from Mann et al. [68]. While XR2, a subset of MR [68], is a combination between virtual worlds and ubiquitous infrastructure, AltR uses multimedia technology to engage users in experiences of other realities corresponding to different places, times, situations, and contexts [19]. Thus, the intersection of lifelogging, AltR, and XR2 contours our scope of investigation.

Fig. 2
figure 2

Venn diagram showing the scope of our work at the intersection of Cross-Reality [94] and Multimedia Alternate Reality [19]. Note: the taxonomy of various types of realities, including Virtual Reality (VR), Augmented Reality (AR), Cross-Reality (XR2), Mixed Reality (MR), and Mediated Reality (X-Y.R), uses the representation formalism from the diagram of Mann et al. [68] (p. 13), to which we added Multimedia Alternate Reality (AltR)

2.3 An overview of video streaming protocols

We relate to video streaming protocols since streaming represents the minimal functionality expected from systems that implement broadcasting of personal visual realities to remote viewers. According to Cisco’s Annual Internet Report for 2018-2023 [22], video streaming is forecast to account for 82 percent of the Internet traffic by 2022, of which 22% will represent UHD IP video. Increased availability of personal video camera devices and lifelogging practices becoming more common [39] will lead to even more creation and consumption of video-based content.

Many protocols are available for streaming content, including video; see [113, 133] for overviews. Common protocols include the Real-time Transport Protocol (RTP) [49], Real-time Streaming (RTSP) [103], the User Datagram Protocol (UDP) [100], and the Hypertext Transfer Protocol (HTTP) [104]. For instance, RTP [49] is a push-based protocol for video and audio streaming over IP-based networks for communication and entertainment systems that involve streaming media, such as telephony, video teleconference applications, television services, and web-based push-to-talk features. RTSP [103] is employed in communication systems that necessitate real-time sessions between end users or between streaming servers and clients. HTTP [104] can be employed to transfer various types of content, including text, images, graphics, audio, and video.

Some communication protocols have been specifically devised for video in order to optimize particular aspects of the delivery chain. An example is HTTP Adaptive Streaming (HAS) [119] that supports most of the video traffic on the Internet due to reliable transmission, cache infrastructure reuse capability, and firewall traversal. HAS is employed in Over-The-Top Players (OTTP), such as NetFlix and YouTube. To improve the performance of adaptive streaming over HTTP, several rate adaptation algorithms have been introduced [64, 65, 79, 136, 147], and the literature has equally focused on improving specific aspects of HAS, such as adaptive bit rate selection [112], buffer-based policies [63], and QoE models [11, 12, 119]. Other protocol examples include Microsoft Smooth Streaming (MSS) [120], Apple HTTP Live Streaming (HLS) [93], and HTTP Dynamic Streaming (HDS) [2]. For instance, HLS was designed to dynamically adapt to network conditions by optimizing playback at client level. The typical length of media segments in HLS is 10 seconds, which determines its latency. To improve on this aspect, the Low-Latency extension of HLS [102] was proposed. Also, the MPEG group introduced the Dynamic Adaptive Streaming over HTTP (MPEG-DASH) to unify existing solutions (e.g., MSS, HLS, HDS, etc.) by defining the format of media segments and the Multimedia Presentation Description (MPD).

2.4 Smart eyewear devices with video streaming capabilities

A large number of commercially available smartglasses feature video streaming functionality [7, 30, 66]. Applications for smartglasses have a wide addressability, among which applications to improve the quality of life for people with visual impairments [42, 131, 135], receiving information about one’s surroundings [110], streaming the wearer’s first-person video to remote viewers [3, 72, 105], consuming media projected on see-through lenses [52, 89, 90], and lifelogging [4, 5]. For example, Stearns et al. [131] introduced a system that used the HoloLens HMD to render video captured by the smartphone: on HoloLens, users with low vision were able to see a magnified version of the visual reality. Tanuwidjaja et al. [135] developed a smartglasses-based system for color substitution to assist colorblind people to distinguish color information. Rio and McCullough [72, 105] discussed telementoring and teleproctoring applications. Other systems employed video streaming to provide access to remote events rendered via HMDs [90], view favorite parts from videos recorded in 360\(^\circ\) [89], end-to-end VR streaming [52], and super-multi view 3-D content delivered from an autostereoscopic device to a web browser [117].

2.5 Lifelogging

Lifelogging is the practice of using devices with embedded sensors, such as wearable video cameras [4, 16, 44] or activity and location trackers [75, 148], to record aspects of everyday life [40]. Due to wearable devices becoming more and more affordable and available, lifelogging has become a phenomenon [35] with many people adopting this technology as a life practice. If in 1996, JenniCam [41] was streaming student Jennifer Kaye Ringley’s daily activities at a frequency of one snapshot every three minutes, today’s high-speed action cameras can capture 1080p quality images at the impressive speeds of 240 fps [41]. Several commercial products are available for lifelogging enthusiasts, among which Narrative Clip 2 [86], MeCam [74], and SnapCam [125]. Lifelogging prototypes based on wearable video cameras [16, 28, 44] are well represented in the scientific literature, which has targeted specific applications, such as food-logging [56], vehicular lifelogging [5, 73], thing-logging for the Internet-of-Things [36], logs of computer usage [43] and sleep patterns [75], and applications that monitor and report aspects regarding the quality of life [148]. Among these applications, lifelogging has been used for people with cognitive disabilities [6, 14, 132]; e.g., Berry and Stix [14, 132] discussed two applications for people with Alzheimer’s that enabled users to relive recent visual experiences with the help of the data stored in the lifelog, and Al Mahmud et al. [6] proposed an image capturing device for people with aphasia that captured photographs and added tags automatically.

2.6 Privacy aspects about the use of video cameras in public places

With the increase in the availability of mobile and wearable devices with built-in video cameras that feature video recording and streaming functionality, ensuring the privacy of bystanders represents an important ethical aspect of using such devices in public places. Many studies from the scientific literature have examined concerns regarding public video recording with wearable video cameras [21, 27, 31, 45, 46, 57,58,59, 101, 122]. Some studies have focused on the social acceptability of smartglasses [27, 59] while others analyzed sensitive lifelog data [45] or documented the behavior of lifeloggers [46, 58]. Also, the legal and ethical implications of video streaming from public places have been addressed [31, 101], including practical solutions for bystanders in the form of privacy mediation [27, 45, 57]. For example, Koelle et al. [57] proposed a set of free-hand gesture commands to enable bystanders to signal their preferences about being recorded or not by users of wearable video cameras. By using a gesture vocabulary that video camera devices could detect and understand, passers-by could explicitly express their consent or disapproval for video recording. Some systems, such as Life-Tags [4], considered privacy aspects as part of their design requirements and replaced video recording with concepts (tags) extracted from videos.

3 The Alternate Reality broadcast-time matrix

We introduce in this section a conceptual space with two axes, broadcast and time, in the form of a discrete two-dimensional matrix, in which we identify multiple opportunities to design systems for broadcasting users’ personal visual realities to remote audiences. We draw inspiration from Johansen’s 2 \(\times\) 2 time-space groupware matrix [53] for group interactions that depicts all possible combinations of location (where communication between individuals takes place) and time (when communication takes place). For example, Johansen discussed face-to-face interactions that occur in the same place and at the same time as well as communication and coordination between the members of a group that can be implemented remotely and asynchronously, i.e.in a different place and at a different time, among the four categories permitted by the time-space matrix [53]. From Johansen’s matrix, we borrow the “time” axis (i.e., when the broadcast of the personal visual reality takes place), which we extend to cover practical aspects regarding streaming content in relation to the latency of video streaming protocols [141]. Also, since broadcasting to a remote audience implicitly specifies distinct locations, i.e., the locations of the broadcaster and of the members of the audience, we replace Johansen’s “space” axis with “broadcast” to distinguish between various types of content related to the visual reality that is shared. With these two axes, broadcast and time, our matrix specifies what part of one’s personal visual reality is shared and when sharing occurs in terms of the synchronicity between the broadcasters and their audiences. By specifying all the possible combinations between the categories of these two axes, our matrix generates a two-dimensional conceptual space, the Alternate Reality Broadcast-Time space (AltR-BT), in which life broadcasting systems can be positioned, characterized, and compared, and ideas regarding future versions of such systems can be generated by following a systematic approach.

Figure 3 shows a visual illustration of the AltR-BT conceptual space for broadcasting personal visual realities to remote audiences, in which 40 combinations of broadcast \(\times\) time have been identified. However, a larger number of possibilities exists in this space, and systems may implement more than one category from each axis, e.g., there are \({10}\atopwithdelims (){2}\) = 45 ways to select two categories from the broadcast axis that, when combined with each category from the time axis, generate 180 new design possibilities; also, there are \({10}\atopwithdelims (){3}\) = 120 ways to select three broadcast categories, leading to 480 designs by combining them with each time category, and so on. Next, we discuss each axis in detail.

Fig. 3
figure 3

The Alternate Reality Broadcast-Time matrix, a conceptual space to inform design and implementation of systems for broadcasting personal visual realities. Note: in Sect. 4, we discuss a few demonstrative prototypes, highlighted in this space

3.1 The time axis

By drawing from the time axis of Johansen’s [53] matrix, we distinguish between synchronous and asynchronous communication between the broadcaster and their remote audience. On this axis, we consider different application requirements for the interactivity between the broadcaster and the audience, and identify four categories, as follows (see Fig. 3):

  1. 1.

    High-level interactivity expected. This category includes applications for which two-way live interactivity between the broadcaster and the audience is critical [107]. For example, applications for video conferencing, collaborative work, and remote labs, where synchronization is key in order for the task to be accomplished effectively, fall in this category. Under the requirements of a highly interactive life broadcasting system, a practical technical implication is a very low latency for the broadcast delivered with a high-performing streaming protocol. One possible implementation at the moment of this writing is the Web Real-Time Communication protocol (WebRTC) [106], regularly employed for two-way web conferencing and telepresence applications.

  2. 2.

    Medium-level interactivity expected. This category contains applications with less stringent requirements of the interactivity expected between the broadcaster and their audience. Strict synchronization is not required for the life broadcasting system to operate effectively and tasks to be accomplished. Examples of applications include user-generated content live streams, game streaming, and e-sports. Existing video streaming protocols that could be employed to support this category of applications include Low-Latency HLS [102] (latency of about two seconds at least), Low-Latency CMAF for DASH [109] (two seconds at least) and RTSP/RTP (one second at least); see [141] for details.

  3. 3.

    Low-level interactivity expected. With this category, we are entering the asynchronous part of the time axis. Applications that do not necessarily expect interactions between the broadcaster and the audience, or for which interactions are allowed to not be synchronized with the broadcast, such as streaming news, sport events, or one-way streams of reality shows and live events to large audiences. Examples of currently in-use video streaming protocols suitable to support applications from this category include Apple HLS [93] and MPEG-DASH [82] with around eighteen seconds latency, according to reports from [141].

  4. 4.

    No interactivity expected between the broadcaster and the audience. For these applications, the content is archived and broadcast only on demand. YouTube video streaming of content that was previously uploaded by creators represents a relevant example. In this case, the audience can query the archive to locate video recordings that match specific keywords, descriptions, or time intervals. Streaming NetFlix content also falls into this category. Searching through the lifelog [39] for recollecting, reminiscing, retrieving information, reflecting, and remembering is another example.

From the first to the last category on the time axis, the interaction between the broadcaster and their remote audience becomes less constrained, from high-level interactivity to no interactivity expected at all. To exemplify possible implementations of life broadcasting systems along the time axis, we have enumerated in this section several streaming protocols that could be employed at the moment of this writing to meet various application requirements. However, our time axis should be viewed from the higher-level perspective of specifying interactions that need to take place either synchronously or asynchronously rather from the perspective of the latency delivered by one streaming protocol or another. Moreover, these requirements are inherent to human cognition and do not change as streaming protocol technology evolves.

3.2 The broadcast axis

The time axis specifies how the time zones of the broadcasters and their audiences overlap during the broadcast with implications regarding the interactivity permitted between the broadcaster and the audience. In the following, we look at the video content that is broadcast in relation to one’s personal visual reality, for which we consider various forms to transmit it and make it available to the remote audience. We distinguish two major categories on the broadcast axis, video and non-video, and a number of ten subcategories (see Fig. 3 for a visual illustration), as follows:

  1. 1.

    Original, unmodified video. The video that is broadcast is not processed in any way and, thus, the audience receives the original version of the broadcaster’s visual reality as captured by the video acquisition device. This category contains applications of all types of video cameras: fixed, wearable, or embedded in various devices. For example, wearable video cameras can be attached to the body to record and stream video from a first-person perspective. Commercial products for lifelogging, such as Narrative [86], Google Clips [38], MeCam [74], and SnapCam [125], and videos captured with photo cameras and smartphones are also part of this category [1, 18, 60]. First-person, eye-level video also falls into this category, but is distinct from the previous examples in that the video camera is worn at eye-level so that video is recorded and streamed from the eye height of the wearer [108] -example is the Life-Tags prototype of Aiordăchioae and Vatavu [4]. The importance of a correct perspective and point of view for lifelogging systems has been explicitly stated in the scientific literature, see Gurrin et al. [39] (pp. 20, 31, 39), but most commercial products for lifelogging [38, 74, 86, 125] have not yet implemented such recommendations. Other examples of applications from this category use video cameras integrated into other devices that track the user around, such as personal drones [83], or are employed by the user, such as cars [99].

  2. 2.

    Mediated video. The video that is broadcast may be processed, either locally or at the video streaming server, so that the audience receives a mediated view of the personal visual reality of the broadcaster. According to Mann [67], mediation offers a wide range of options, from processing video using computer vision algorithms (e.g., contrast enhancement, color correction, etc.) to diminished reality (where objects from the background are faded out so that visual attention can focus onto a subset of objects of interest). For example, the full visual reality of the broadcaster may be accessible to a selected group of members from the audience, while the rest have access to a diminished version only, from which sensitive and personal content is removed. For example, the “Audience Silhouettes” [137] prototype broadcasts 3-D body silhouettes of TV viewers to the members of their social group, but removes the background and any personal traits (e.g., face, clothes, etc.), leaving just their body silhouettes for non-verbal communication.

  3. 3.

    Augmented video. Different from mediation [67, 68], augmentation refers to adding new information, in the form of computer-generated content, on top of the video capture of the physical reality. Examples include highlighting faces in video streams, displaying the names and descriptions of identified persons, showing details of specific objects, and so on. AR applications [15] fall into this category.

  4. 4.

    Virtual-world video. In this category, the audience is presented with a completely virtual version of the broadcaster’s visual reality. Milgram et al. [77] discussed transitions between different mixtures of the physical and the virtual leading to completely VR worlds, and Vatavu et al. [139] presented the virtual world/conventional TV and virtual world/virtual TV examples for AR television, where TV viewers are immersed into virtual environments, from where they watch television either in the form of a video stream of their physical TV set or a virtual TV from the VR environment.

  5. 5.

    Broadcast summaries. Represent individual snapshots that are broadcast instead of actual video. The advantage of this approach is reduced network bandwidth and flexibility in terms of the events that trigger streaming. For example, Google Clips [38] automatically starts recording when it detects a familiar face or a pet, and video surveillance systems start recording/streaming when motion is detected, etc.

  6. 6.

    Life abstractions. Represent summaries of the broadcaster’s visual reality over a specified period of time. The concept of life abstraction was introduced by the Life-Tags prototype of Aiordăchioae and Vatavu [4] to present users with short summaries of their lifelog in the form of word clouds of concepts automatically extracted from video.

  7. 7.

    Visual news tickers. Represent a particular form of life abstraction in which the concepts automatically detected from video are broadcast to the audience and displayed in the form of a narration. DeepDiary [32] is an example of such a system that employs automatic image captioning techniques to produce textual representations of video lifelogging data. Such narrations can be rendered as text to the audience as if they were a “news ticker” reporting on the life of the broadcasters.

  8. 8.

    Audio news tickers. Render the content of the previous category using text-to-speech techniques. For example, Soundscape [76] is an application that provides information about one’s surroundings, the “Sotto voce” [8] system enables sharing of audio information, and audio AR systems [13, 142] fall into this category as well.

  9. 9.

    Haptic news tickers. Convert content to vibrotactile feedback delivered on the body of members from the audience, such as on the wrist (e.g., by a smartwatch), the arm (by a smart armband), and so on. Tactons [17], digital vibrons [138], Flex-n-Feel [121], and haptic AR systems [61] are examples that relate to this category.

  10. 10.

    Multimodal news tickers. Represent multiple possible combinations of the previously mentioned modalities to broadcast life to the remote audience. For example, the audience may experience both visual and haptic news tickers, where the latter accentuate particular events.

From the first to the fourth category of the video part of the broadcast axis, the relationship between the video content that is broadcast and the corresponding visual reality of the broadcaster becomes weaker. From the original, unmodified version of the video captured by the acquisition device to the virtual reconstruction and representation of the broadcaster’s visual reality, the audience is presented with different versions of what the original visual experience of the broadcaster has been. In this work, we are primarily interested in broadcasting visual life, but video is not the only modality to attain this desiderata. Consequently, the non-video part of the broadcast axis specifies other modalities by the means of which the visual experience of the broadcaster can be delivered to their audience, from textual descriptions of visual life to video rendered using haptic feedback. The broadcast axis reveals thus a variety of modalities in which personal visual realities can be communicated to the members of the broadcaster’s audience. These modalities go past mere video streaming toward richer modalities for self-expression of the broadcaster and consumption opportunities for the audience. Other categories could be added to the broadcast axis, such as categories that address the gustatory or olfactory senses [88] of the audience. However, since research regarding these modalities is still scarce compared to the large body of work on rendering visual, audio, and haptic feedback, we opted for limiting, for now, the number of categories on the broadcast axis to keep it manageable from a practical point of view. Future work can incorporate such new categories to address new applications of broadcasting personal visual realities.

4 Demonstrative prototypes

We describe in this section three prototypes that demonstrate various design possibilities from our AltR-BTconceptual space for sharing personal visual realities with remote audiences. Each prototype builds on prior work and extends it by implementing various AltR-BTcategories.

4.1 Broadcasting first-person, eye-level video

Aiordăchioae [3] presented a simple, yet effective solution for sharing first-person eye-level video, captured from a pair of camera glasses, with remote viewers. Aiordăchioae’s technical solution employed HTTP, WebSockets, JavaScript, and node.js technology to transfer individual snapshots from the camera glasses to clients running in the web browsers of the remote viewers’ desktop PCs, smartphones, and tablets; see [3] for details regarding the software architecture and implementation. By using concepts and categories from the AltR-BTspace, we can characterize Aiordăchioae’s system as the design combination {original, unmodified video} \(\times\) {low-level interactivity expected}. Thus, this prototype reflects the intersection between XR2 and AltR, where the remote audience can experience the broadcaster’s visual reality, but little interactivity is required.

In the following, we show how the prototype from [3] can be extended by considering more categories from the time axis of the AltR-BT space. Specifically, we re-implemented the technical solution from [3], but replaced the request-response communications for the transfer of individual snapshots from the camera glasses with established video streaming protocols, which were discussed in Sect. 2. We employed the same model of video camera glasses [30] that feature a full-HD micro video camera with a 90\(^\circ\) field of view and Wi-Fi operation; see Fig. 4, top left image for a snapshot of the user (broadcaster) wearing the glasses. The software implementation employed the node-media-server [80] npm package for node.js that provides server-based implementations of the RTMP, HTTP-FLV, WS-FLV, HLS, and DASH video streaming protocols; see a description of these protocols in Sect. 2. The server-side application receives an incoming video stream from the camera glasses in the Advanced Streaming Format (ASF) [34], which is converted to RTMP [95] using the FFmpeg [33] module for nodej.s. This conversion enabled us to use node-media-server to publish live streams delivering http-flv, websocket-flv, HLS, and DASH video. To access these live streams from a web browser, we implemented web clients using dash.js [25], hls.js [140], and flv.js [47]; see Fig. 4, top right image for an illustration. On a smartphone client (Samsung Galaxy J6, Octa-Core 1.6 GHz Cortex-A53 CPU, 3 GB RAM, Android 9, Chrome), we received video with a latency of 6 seconds on average for RTMP, 23 seconds for HLS, 42 seconds for DASH, and 8 seconds for HTTP-FLV and WS-FLV, respectively. Figure 4, bottom illustrates the categories from the AltR-BT space implemented for this prototype.

Fig. 4
figure 4

First-person, eye-level video streaming using camera glasses. Top-left: the broadcaster wearing a pair of camera glasses. Top right: snapshot of a smartphone rendering DASH and HLS video streams. Bottom: visual illustration of the AltR-BTconcepts involved

4.2 Broadcasting third-person perspectives using a personal drone

We extended the prototype from the previous section to work with video captured from a personal drone providing a third-person perspective. To this end, we used the Parrot Mambo FlyFootnote 2 to which we attached a 720p HD video camera; see Fig. 5. We kept all the technical details of the software implementation (node.js, HLS, DASH, HTTP-FLV, WS-FLV) and, instead of connecting to the pair of glasses, we connected to the drone’s video camera. The result is life broadcasting of original video captured and delivered from a third-person, aerial perspective.

Fig. 5
figure 5

Third-person perspective video streaming with a personal drone (top-left image) to a laptop (top-right). Bottom: visual illustration of the AltR-BTconcepts involved

4.3 Broadcasting first-person, eye-level mediated and augmented video

To further explore the design categories of the broadcast axis, we revisited an application for Microsoft HoloLens from Pamparău and Vatavu [92] that implemented mediated and augmented vision; see Fig. 6, top-left. This prototype implements several image processing algorithms, such as color correction, edge highlighting, contrast and brightness adjustment, and face detection, which were informed by prior work on vision rehabilitation and augmentation [29, 62, 97, 98, 135, 144,145,146]. The processing techniques, called visual filters in [92], are applied to the video frames captured by the camera embedded in the HoloLens HMD. The result is aligned with the visual reality perceived via the see-through lenses; see Fig. 6, top right and middle images and Pamparău and Vatavu [92] for more details. From the perspective of our scope of investigation, this prototype is at the intersection of XR2 and AltR, enabling the remote audience to experience not only an original version of the broadcaster’s visual reality, but also its mediation and augmentation as well. We implemented broadcasting of the user’s mediated and augmented vision with the OBS Software [129] and a YouTube channel for live streaming. This solution enables a remote audience to access the augmented and mediated video broadcast from the web browser of any device, e.g., desktop PC, laptop, smartphone, smartwatch, etc. The bottom part of Fig. 6 illustrates the categories from the AltR-BTspace, i.e., {original video, mediated vision, augmented video} \(\times\) {medium-level interactivity expected}.

Fig. 6
figure 6

The HoloLens HMD prototype implementing augmented and mediated vision. Top left: the broadcaster wears HoloLens that delivers a contrast-adjusted and color-corrected version of the visual reality. Top right and middle: the YouTube live video streaming enabling other viewers to access the mediated visual reality of the broadcaster. Bottom: visual illustration of the AltR-BT concepts implemented by this prototype

4.4 Broadcasting life abstractions

Aiordăchioae and Vatavu [4] introduced Life-Tags, a smartglasses-based system designed to automatically capture snapshots and to abstract them in the form of word clouds of visual concepts (tags). Life-Tags was the first wearable system implementing life abstraction in the context of lifelogging applications. However, Life-Tags was designed to operate mainly by saving the snapshots and associated tag clouds for the user’s benefit and less for sharing them to a remote audience. According to our broadcast-time matrix, Life-Tags implements {life abstraction} \(\times\) {no interactivity required (archived content)} and is found at the intersection between lifelogging technology and XR2.

Fig. 7
figure 7

The extended version of Life-Tags [4]. Top: the broadcaster wearing camera glasses, an example of a tag cloud, and a member of the audience wearing the Myo smart armband. Bottom: visual illustration of the AltR-BTconcepts implemented by this prototype

In the following, we show how Life-Tags [4] can be extended for live streaming of tag clouds and for other output modalities from the broadcast axis. We implemented Life-Tags described in [4] as follows: snapshots automatically captured from micro video camera glasses are stored temporarily on the user’s smartphone, from where they are offloaded to a permanent storage like a desktop PC, and a third-party processing service [23] is used to generate descriptive tags, which form word clouds [24]; see Fig. 7, top right for an example. We implemented a node.js server-side application using the Socket.IO library [127] for real-time, bidirectional and event-based communications, and represented tag clouds as JSON objects. For the client side, we employed the Android Socket.IO client library [128] to implement the communication with the server. On the client side, multiple options are available to render the tags: visual display of tag clouds, a visual news ticker where tags are scrolled into view, audio rendering of the tags via cloud text-to-speech API [37] and haptic feedback. Regarding the latter, our choice for a haptic rendering device was the Myo smart armband [87]; see Fig. 7, middle image. Myo is a light-weight (93 grams) arm band expandable between 19 and 34 centimeters that provides haptic feedback in the form of short, medium, and long vibrations. We implemented associations between specific concepts, e.g., “travel” or “working,” with specific vibration patterns, e.g., medium pulse, pause of 500 ms, short pulse, pause of 500 ms, long pulse; see examples in the literature [138] and previous results regarding user perception of vibrotactile feedback at forearm level [114]. Figure 7, bottom illustrates the categories from the AltR-BTspace, i.e., {life abstraction, visual news ticker, audio news ticker, haptic news ticker} \(\times\) {no interactivity expected}.

5 Conclusion and future work

We introduced in this paper the broadcast-time matrix as a conceptual space to guide researchers and practitioners interested in implementing life broadcasting to remote audiences. Beyond video broadcasting, however, our conceptual space enumerates many design options and covers a wide palette of possibilities to broadcast and consume personal visual realities, from mediated video to life abstractions and delivery of haptic feedback connected to specific events tagged in the visual life of the broadcaster. The broadcast-time matrix represents an opportunity to explore new designs of interactive systems located at the intersection of lifelogging, alternate reality, and cross-reality technology, as illustrated in our Venn diagram from Fig. 2. The implications for lifelogging regard new ways to store, stream, and query information collected about one’s visual life for the purposes of recollecting, reminiscing, retrieving, reflecting, and remembering, i.e., the five R’s of memory access with lifelogging systems [116]. For alternate reality, being connected to the other people’s life events that take place in remote locations and even different times, our broadcast-time matrix enables a systematization of possibilities to deliver such experiences from the practical perspective of available modalities, from mediated video streams to news tickers with just the key aspects of the broadcaster’s visual life. For cross-reality, our broadcast-time matrix puts together requirements about ubiquitous networked infrastructure (e.g., synchronous communications) with characteristics of virtual experiences (e.g., mediated video and virtually reconstructed worlds) so that practitioners can better inform the technical designs of their prototypes.

Future work will look at extending the broadcast-time space with new categories, e.g., other modalities on the broadcast axis to deliver gustatory and olfactory experiences [88] to the remote audience, but also adaptations to specific application domains, such as new television environments [137, 139]. More practical applications are also envisaged, incorporating new sensors to acquire various aspects of visual life, new contexts of use, and specific user categories as well. We hope that our contributions, at the intersection of lifelogging, alternate reality, and cross-reality, will draw the attention of the scientific community interested in multimedia research toward the many opportunities enabled by putting together concepts from these areas of investigation.