Keywords

1 Introduction

The soundscape is the combined audio components being heard at a given moment in time [52]. Involuntarily, we are continually interacting with the soundscape, as – unlike visual interaction – we cannot ‘close our ears’ to stop an audible input. With this in mind, uncontrolled audio environments, have shown to impact individual wellbeing, substantially heightening stress, and causing a long-term decline in workplace efficiency [31].

In regards to these topics, there are many more research efforts occurring in the fields of affective acoustic ecology [15], and general sound recognition [12], with smart-device applications for aiding sleep, and meditative states now being much more commonFootnote 1. However, such apps do not yet personalise the audio in real-time manner.

Herein, we present background and initial findings to support the development of a system which does create a personalised interaction with sound. Showing through previous studies that multiple modalities can be used non-intrusively to gain an understanding of a user’s current state [3], and that deep generative approaches show the ability to generate affective data [4].

This contribution is structured as follows; first we conduct a brief literature review of related work under the topics of sound and stress reduction, Human Computer Interaction (HCI) and affective audio, and computational audio generation approaches. We then perform a preliminary perception study, based on a rudimentary approach for soundscape augmentation, and discuss the results. Proceeding this we propose a state-of-the-art method for the application of affective soundscape audio augmentation. Finally, we conclude our results, and summaries our outlook for the next steps of this research area.

2 Related Work

2.1 Sound for Wellbeing

When discussing sound as a taxonomy, this extends across many branches, from environmental sounds to speech. Within the field of sound healing, there are many sound sources including; acoustic and synthetic, which show to have a variety of wellbeing benefits including stress reduction [50]. Previous studies have suggested that excessive sound levels can have an effect on the hospital working environment, having long-term implications for nursing staff [40].

Acoustic-based tools are used commonly by healing practitioners, e. g.  tuning forks at 128 Hz for relieving tissue-based abnormalities [19], or ritual communal drumming, which has shown to improve wellbeing in young people [59]. Another sound-based practice aimed at the reduction of stress is Transcendental Meditation (TM) [27], partially utilising the spoken mantraFootnote 2. Additionally, through the integration of both sound and breathing techniques, TM has shown to both physiologically and psychologically reduce stress, quantitatively showing a decreased average theta (\(\theta \)) when monitoring via Electroencephalography (EEG) and increased alpha (\(\alpha \)) [18].

As well as such vocal mantras, practitioners of TM integrate a series of acoustic instruments such as the Tibetan or crystal singing bowls. These bowls have a long history of use in mediation [24] and are played with a continuous oscillation around the circumference of the bowl, resulting in a full overtone sound which sustains a prolonged resonation. The Tibetan singing bowl has been applied to many stress reducing scenarios – including as an aid to school teachers [9] – and has shown to increase feelings of spirituality, in turn relieving symptoms of stress including tension [20]. The Tibetan Singing bowl has also been integrated in a variety of mHealth stress reducing targeted apps [22].

In regards to synthetic-based sound tools, there have been a variety of studies which have shown stress reduction results, in various environments [31]. Synthetic music within a hospital has shown to have a strong impact on a patient’s experience [61]. Similarly, the acoustic environment of a workplace benefits from artificial acoustic design [30], and through integration of synthetically designed audio environments the workplace experience also improves [25]. Synthetic audio generation has also been investigated in the realm of therapeutic applications, specifically exploring how synthetic sound might influence listeners experiences in psychological areas, such as creativity or self-perception [44].

There is also much research focused on how listeners perceive music and how emotions are brought on by music or what psychological mechanisms causes these emotions [26]. For example, music is often used to enhance the emotional impact of movies [7]. Unlike most other stimuli that evoke emotions, such as encounters with dangerous animals, threats or facial expressions, music has no obvious, intrinsic survival value [34]. Blood et al. presented a novel approach to the study of music and emotion, using positron emission tomography to measure cerebral correlates of affective and perceptual responses to musical dissonance [8].

2.2 HCI and the Use of Audio for Wellbeing Applications

Over the past two decades, researchers have increasingly realised the importance of recognising the emotional aspects which occur during human-computer interaction (HCI) [11]. For example, in many HCI scenarios a computer aided tutoring system is highly desirable and a response based on emotional or cognitive state of the human user may improve user experience [55]. During interaction humans provide emotion-based cues from physical gesture, facial expressions and also the voice [23]. Nowadays affective recognition systems are mainly developed through 2 key qualitative steps: understanding emotional response, adapting the development based on user experience.

One method for non-intrusively understanding a user’s experience is through the voice, and there is an abundance of HCI applications specifically in the realm speech recognition, e. g.  voice dialling [43]. Automated speech recognition systems are also integrated in language learning paradigms to improve pronunciation [51]. As well as his Voice-based user interfaces are becoming ubiquitously available, being embedded both into everyday mobility via smartphones, and into the life of the home via assistant devices [46].

As well as the voice, there has been an increased interest in the impact of in-game audio. Paterson et al. developed an audio design with a complex and immersive soundscape, which is emotionally engaging and supports the game narrative [45]. Similarly, Roden et al. proposed a framework for interactive narrative-based audio only adventure games [47], and Sliwinski et al. explored the development of an audio-visual game to induce wellbeing and mindfulness [56] Similarly, Rogers et al. discussed games which are considered relaxing and encourages research directions for exploring the role of game audio specifically, to improve player wellbeing, via stress reduction [48].

Thus, there is much research exploring the potential use of audio for wellbeing. Roger et al. identified the effect of music in games as a preventative measure against stress in everyday life by facilitation of relaxation [48]. In relation to health specifically, Willianmson et al. explored first-time mothers’ breast-feeding difficulties through the use of audio-diaries [58], and Mirelman used audio-biofeedback for improving Parkinson’s patients balance [38]. Additionally, Dijk et al. presented the concept of auditory–tactile stimulation for health and well-being through carefully selected audio–tactile stimuli causing a person’s bodily, mental and emotional state to be altered [13].

2.3 Audio Generation

Although the scope of this study is focused largely on the generation of complex soundscapes, audio generation can refer to an array of audio-based fields, from speech synthesis to instrument modelling. In this regard, many of the methods mentioned will be found across all such domains, but are not limited to them. Conventional computational methods for audio generation would include a variety of digital signal processing approaches, such as Hidden Markov Models [53] or cellular automata [10]. These methods are still applied today, however the current state-of-the-art for the term audio generation would refer to a division within machine learning in which systems are largely data-driven [57].

An earlier deep approach for generating audio was Deep Minds WaveNet  [57]. WaveNet is a progressive auto regressive generator, and is an audio adaptation of the PixelCNN [42], modelling features of raw audio which are represented as 8-bit audio files, with 256 possible values. During the training process, the model predicts values for waveforms (audio signals with a temporal resolution of at least 16 kHz samples per second) at each step comparing them to the true value, using cross-entropy as a loss function. In this way, the WaveNet architecture is applying a multi-class classification of 256-classes [35]. As a means of decreasing the computational time expense, that may be associated to such a classification task, WaveNet applies the method of stacked dilated casual convolutions, reducing the receptive field without any substantial loss in the resolution [60].

Although WaveNet has been showcased in the speech synthesis domain, the applications are broad. The original architecture showed promise for high fidelity in music with comparable human perception results [57]. Recently, an adaptation of the WaveNet framework is the NSynth (Neural Synthesizer) auto encoder specifically tailored towards synthesis of musical notes [16].

Another neural network approach, which was motivated by WaveNet , is SampleRNN [37]. This model is an unconditional end-to-end neural audio generation architecture that uses auto-regressive multilayer perceptron’s and a Recurrent Neural Network (RNN), in a hierarchical structure, to capture temporal variance over large audio signal durations. Despite showing competitive human perception results against WaveNet , the SampleRNN suffers from unrealistic computation time and the perception results are not shown to be significant, rather tendencies [37]. However, the advantages of time dependent RNNs would be suitable for soundscape generation offline.

First proposed by Goodfellow et al. in 2014 [21], Generative Adversarial Networks (GANs) have found recent popularity within the data generation domain and are arguably becoming a fundamental approach for this type of task. Essentially, generating new samples of audio based on raw audio signals, GANs are a pair of unsupervised networks which compete against each other, generating new instances of data until the discriminator, can no longer reliably tell a difference.

As well as being applied for the task of unsupervised representation learning from audio spectrograms [1], GANs aimed specifically for use with audio generation were first introduced in 2018, with WaveGANs and SpecGANS [14]. Approaches typically applied in the vision domain, were explored by extracting spectrogram images and comparing the networks ability to generate audible spectrogram instances. This was followed by the Conditional WaveGAN [33], which specifically focused on waveform generation through a concatenation based conditioning approach. Despite WaveGAN showing strong results for what is described as human audible samples, post-processing for noise reduction and appropriate optimisation due to instability were required.

3 First Step Soundscape Augmentation Perception Study

To evaluate the efficacy of augmentation of the original soundscape to alter emotional perception, we conducted a short listening test with 10 individualsFootnote 3. Listeners evaluated arousal and valence dimensions of emotion [49], for each audio file (listening in a randomised order, twice before giving their score), on a 5-point Likert scale (e. g.  0 = Low arousal/valence, 4 = High arousal/valence). All listeners used headphones for this study.

Fig. 1.
figure 1

Spectrogram representation of 30 s from a Tibetan Singing bowl recording (left), and a generated pure sine tone (right). Although similar in fundamental frequency, it should be noted, that overtones (as well as reverberation) which can be observed in the spectrogram representation of the singing bowl, may play a strong part in altering a listener’s affective state.

3.1 Preliminary Acoustic Analysis

As we have mentioned previously in Sect. 2.1, the singing bowl is a common acoustic instrument used by healing practitioner (including in Transcendental Meditation) for improving states of wellbeing. With this in mind, we have chosen to use its most similar synthetic signal – a Sine wave – for this first-step augmentation approach. A sine wave (also known as a sinusoid) is a continuous periodic oscillation. As a function of time (t), a sine wave can be expressed as:

$$y(t) = Asin(2\pi ft + \varphi )$$

where in this case A refers to amplitude from zero, f the frequency, i. e.  the number of oscillations (cycles) occur over t, and \(\varphi \) is phase i. e.  when the cycle of oscillation is t = 0.

We performed an initial acoustic analysis of multiple recordings from the singing bowl, taken from the Acoustics Sounds for Wellbeing Dataset [6], and compared this sine waves of matching frequency. Findings show that characteristics of the audio are similar (cf. Fig. 1, for spectrogram representation). For example, both are a continuous single frequency oscillation, and when monitoring pitch continuously the standard deviation came to 24.9 Hz, and 23.4 Hz for Tibetan and Sine, respectively. However, it is worth noting that aspects from the singing bowl such as resonance (and even human intervention) may play a deeper part in the improvement of wellbeing, and this is not replicated intrinsically through a single sine wave generator.

3.2 Audio Generation Approach

To summaries the rudimentary audio generation approach applied for this initial study, we utilised the Emo Soundscapes Database [17], and extracted Chroma features from 56 audio files (28 with lower rating of arousal and valence, and 28 with higher ratings of arousal and valence). Audio files were within the classes of ‘Mechanical’, ‘Human’, and ‘Nature’, and we then, sonified the corresponding chromatic notes (A-G#) as sine waves, overlaying this onto the original soundscapes.

To achieve this, we developed the first iteration of wellSoundsFootnote 4 In this ‘chromatic approach’ we extract a 12 dimensional chromatic feature set from each trimmed (7 sec) audio file (prior to normalisation). At a given time-step based on the duration of the audio file. Features are then assigned to the corresponding Sine wave frequency (e. g.  65.4 Hz = C2, and 110.0 Hz = A2), and combined to make polyphonic (naive) chord combinations. The segments of audio are then concatenated to make a continuous ‘augmentation’ of the original audio file. The resulting synthetic sine wave augmentation is then mixed onto the original audio file (proceeding a number of post-processing steps including equalisation and compression). A spectrogram representation of the WellSounds augmentation can be seen in Fig. 2Footnote 5.

Fig. 2.
figure 2

Spectrogram representation of original audio and augmented audio. For each of the 3 classes – Mechanical, Human, and Nature. Through post processing of the original soundscapes it can be seen that the energy of noise is also reduced in the augmented soundscape, particularly prominent in the nature example.

3.3 Perception Study Results

Results from the study (based on the 3 classes), are shown in Table 1. To evaluate the significant (or not) difference between soundscape augmentation and original soundscape, we conduct a two-tailed T-test, rejecting the null-hypothesis at a significance level of p < 0.05 and below.

When observing the results from a class basis (cf. left of Table 1), of note we see there is a change in emotion perception across all classes, and particularly for the ‘Nature’ class a significant difference is shown between the augmented and original data types (p = 0.001, and 0.04 for valence and arousal, respectively). Although not necessarily a positive affect for the augmented soundscapes, this does show promise for the ability of such an augmentation approach to alter states of wellbeing. Additionally, from Fig. 3, we see that the standard deviation between listeners is quite wide, and therefore further studies with a larger group of listeners may give a more reliable trend.

When looking at Table 1 (right) – where audio files have been grouped based on their original Emo Soundscapes DB emotion rating (i. e.  High valence/arousal, and Low valence/arousal) – we see that although consistently different to the original source, High emotion does remain to higher than low emotional audio groups. Suggesting that trends in the audio files which are inherent to the emotion are left unchanged. However, this assumption requires further study.

Given this naive approach, further adaptation and audio choices based on emotional content may see further improvements in affective change. It is also worth noting that the audio applied here is extremely rudimentary, and further digital signal processing techniques, along with the use of more typically ‘pleasing’ audio may would be of value to explore.

Table 1. Results from perception study. Evaluating the perception of 10 listeners on a Likert scale of 0–4 for (V)alence and (A)rousal of the (Ori)ginal and (Aug)mented version of the soundscape. In the left table, results are presented based on the 3 soundscape classes (Mechanical, Nature, Human). In the right table, results are grouped by original ratings of (high) a (low) emotional dimensions of valence and arousal from the EmoSoundscape DB. Reporting Mean (\(\mu \)) and Standard Deviation (±) across all listeners. * indicates significant difference, between (Ori)ginal soundscape, and (Aug)mented.

4 A Deeper Approach for Soundscape Augmentation

Based on our initial findings, in this section we briefly outline a methodology for a soundscape augmentation, which is based on an individual’s current state, and would be applied in further studies by the authors on this topic. An overview of this system is given in Fig. 3. Predominately an audio-based approach, we aim to utilise methodologies from the field of Speech Emotion Recognition (SER) [54], as well Generative Voice Conversion [28]. First in this section, we outline the feature extraction method for understanding and individuals state. Following this an offline system, in which the user would define a duration of listening, in a quiet space is defined. We also propose an online system, which in real-time ‘augments’ the natural soundscape, through sonification of audio features, generated based on emotional understanding of the user.

Fig. 3.
figure 3

Overview of the proposed affective soundscape generation system for wellbeing, via multimodal monitoring.

4.1 Feature Extraction and Emotional Prediction

From the user’s input, a fusion of features known to the affective computing community (e. g., MFCCs, and spectral) [29], can be extracted from multiple modes (including the voice, as well as the current soundscape). Of note, in recent works we have found a correlation between biological signals, including hormone-based cortisol and speech features, during a stressful situation [3], suggesting that handcrafted features may be useful in this context to gain an understanding of a user state of lower wellbeing. As well this, if appropriate based on user-device, biological feature can also be utilised for understanding states of lower wellbeing [2].

Utilising deep, pre-trained neural networks, the tailored feature sets can then be classified for their emotionality (e. g., level of arousal and valence) [32]. The resulting, prediction are then used to define the current state of a user, as a condition for audio generation.

4.2 Offline Audio Generation

For the offline generation, perhaps in the scenario where a listener aims to reduce their affective state for short-term period, a pre-existing synthetic emotional source could be used. In this case, a dataset of synthetic audio could be applied, such as the richly annotated EmoSynth database [5]. From this, one-minute emotional samples can be created based on their emotional values; typically, this equates to aspects in audio such as, high arousal being equal to higher pitch and low arousal being equal lower pitch, with valence being a somewhat more complex aspect of emotion in terms of acoustic representation. As a means of obtaining varied (i. e.  novel for each user interaction) audio outputs for each user, with more fine-grained differences, a generative adversarial strategy can be applied, such as ‘StarGAN’ [28]. In this scenario, a network can be trained on a selection of emotional classes. Following this depending on the given emotional prediction (or target) of the individuals state, a synthetic soundscape (the source) is then generated based on the target (user defined) emotion,e. g.  if the user is in a state of high arousal, a low aroused soundscape is generated, for a (user-defined) given period of listening.

4.3 Online Generation Including Feature Sonification

For longer interaction periods, possibly even continual (i. e.  for implementation in a chaotic working environment), we propose a method in which the offline audio generation is combined with a sonification of the features from the natural ongoing soundscape. To summarise this process, features such as, chromatic, energy, and F0 can be extracted from the incoming soundscape signal, and reasoning be applied to sonify the Chroma and pitch-based features based on the energy of the signal at a given time-point. As well as this, the natural rhythm of the soundscape can be extracted and as an option, then applied to the resulting real-time generation. Rhythm is included, as a consistent rhythm has shown to have positive effects on wellbeing, producing a calming affect [36]. These two sonification approaches (feature-based and rhythm) are then applied to the offline generation process previously described, and the user is able to balance the level for each.

5 Conclusion and Outlook

In this contribution, we made preliminary user studies on the effect of augmenting natural soundscapes, as well as proposing a ‘next-step’ methodology for a personalised version of such a system. A series of perception studies [39] including those by the authors [5], support the initial assumption that specific combinations of audio can alter states of individual wellbeing - and initial results in this contribution also show similar trends. Thus, these findings support further development of the work described herein.

When monitoring states of poor wellbeing, there are many emotional states linked to this, prior work by the authors has focused on public-facing speech, as a marker of stress [3]. Findings have shown that through the use of a combination of conventional acoustic features, and machine learning algorithms, biological signals including skin conductance, blood volume pressure, and cortisol can be predicted during such states of lower wellbeing. Based on this, it would be of great interest to approach the development of a multimodal system, however with audio monitoring being non-intrusive and lower in resources, it may alone be the optimal modality.

In regards to audio generation, a deep auto regressive generative model such as WaveNet [41] has shown promise for generating affective data [4], and through the use of a generative adversarial network, the authors are currently experimenting with emotional data in a conversion paradigm, i.e., from one emotion to the other, e. g., happy to sad. Integrating such a generation method here, may allow for more variety in generation, however a naive training approach based on single emotions does also show promise for the desired outcome.