Keywords

1 Introduction

Speech processing in the context of assisting people with various impairments has been researched for decades. The oldest and simplest devices, hearing aids, were just used to amplify the speech signal, and in this way they helped hearing impaired people. Next, when speech processing technologies advanced, speech synthesis and speech recognition started to be applied for visually impaired people. Along with the progress of speech, image and video processing, new, even more advanced systems were proposed, which were used in therapy for autism or dementia, for example.

In the current research we propose a system that uses speech synthesis, accompanied with facial animation, to be used in the therapy of auditory verbal hallucinations. The system was inspired by initial research by Huckvale et al. [13], who for that purpose proposed an avatar with voice conversion. In contrast to their approach, we proposed the use of synthetic speech. Thanks to this change, the therapist can accompany their patient during a therapeutic session and the therapy can take place in a single room.

This paper is organized as follows: first, we will present the theoretical background, briefly describing the problem of voice hallucinations, treatment of psychological diseases using audio-visual techniques and the use of speech synthesis in the context of assistive technologies. Next, in Sect. 3, we describe our proposed solution. Section 4 will present the results of initial experiments using the proposed therapy. Finally, Sect. 5 will summarize the article.

2 Theoretical Background

2.1 Auditory Verbal Hallucinations

In psychopatology, hallucinations are the primary perception disorder in which qualitative changes occur [15]. They are false sensory perceptions, accompanied by a sensation of reality [3], which occur without the involvement of any external stimulus. According to the diagnostic criteria of the International Classification of Diseases (ICD-10), auditory verbal hallucinations are the most commonly encountered and, at the same time, the most characteristic symptoms of schizophrenia – a disease of the central nervous system in which abnormal neurotransmission in the area of the dopamine and serotonine system is observed. The hallucinations interact with the patient, comment on their behavior and provoke them to various actions, etc.

Medications applied for the treatment of schizophrenia symptoms (antipsychotic drugs) lead to, i.a., dopamine receptors blocking (D2). Thanks to that, a decrease in the intensification of productive symptoms (which include auditory hallucinations) is obtained [21]. Auditory verbal hallucinations, often called “voices,” occur with 75% of people diagnosed with schizophrenia. What is more, they are also symptoms of other disorders, such as borderline personality disorder, posttraumatic stress disorder, epilepsy, Parkinson’s disease, as well as dissociative, psychotic and affective disorders. They can also be observed in people without any clinical diagnosis [16].

Clinical practice often shows that medical procedures are not able to help the patients with voice hallucinations in a sufficient way. Half of the hallucinations progress into a chronic form so they are retained for a period of several months or even years despite the application of pharmacological treatment. This symptom often becomes the reason for the psychiatric hospitalization as well as the patient’s prolapse from social functions. Such a situation requires a search for new therapeutic solutions. One of them is cognitive behavioral therapy that allows for the intensification of symptom understanding. It is included in the Schizophrenia PORT Treatment Recommendation [5] and acknowledged as one of the most effective therapeutic approaches.

Auditory hallucinations are often accompanied by feeling difficult emotions by the patient as well as the occurrence of non-adaptive behavior. In the cognitive model of auditory hallucinations [6], stressful emotions and actions do not result from the contents coming from the heard auditory hallucinations but from the meaning and belief that the person gives them. In this conception, work on the relationship between the patient and the symptom is important. In the proposed method, the use of an avatar within the cognitive behavioral therapy enables the extraction and the modification of this relationship.

2.2 Treating Psychological Disorders Using Audio-Visual Techniques

Using audio-visual techniques to treat psychological disorders, including voice hallucinations, has been tested in the past. One such method was described in 2013 by Huckvale and his colleagues from University College London [13]. The authors developed and evaluated a therapy based on an audio visual dialog system. In this method, first, the patient was asked to create an avatar by choosing a face resembling the visual hallucinations in their mind, as well as modifying the voice timbre to resemble their voice hallucinations. Next, during a therapeutic session, such an avatar was driven by the therapist’s voice and the patient ran a conversation with the avatar. Voice conversion was used to change the timbre of the therapist’s voice to sound more similar to the voice in the patient’s hallucinations. A visual speech synthesis (VSS) system with a real time lip synchronization algorithm was used. During the session the therapist was located in another room and was able to talk with the patient with their natural voice by means of a video conference or, depending on a switch position, with their converted voice through the avatar.

The pilot studies confirmed that the application helped patients to control their hallucinations in real life after a series of short sessions. A clinical trial was started in 2012 and it is still in progress [7].

There are also other examples of successfully using audio-visual techniques in assisting patients with psychological disorders. One of them is the therapy of patients suffering from depression [10, 17], and other examples are described for patients with different kinds of phobias [4, 20].

2.3 Synthetic Speech in Assistive Context

Speech synthesis, apart from classical applications in man-machine interfaces, has been known to assist people with various disabilities. One such application is using synthetic speech to support patients who lost their ability to speak, either due to surgery (e.g., a laryngectomy) or neurological damage (e.g., caused by a stroke). Such solutions using text-to-speech (TTS) technology are often called Voice Output Communication Aids (VOCAs).

There has been research on creating VOCAs with personalized voices. Such an approach was proposed for individuals with dysarthria [8]. They proposed the adaptation of a statistical model of dysarthric speech extracted from an individual’s voice, using the HTS toolkit. During the study, personalized synthetic voices for two participants with dysarthria were built and evaluated. Participants assessed the technique as promising and convincing.

TTS technology undoubtedly provides enormous support to visually impaired people. There are a lot of free and commercialized reading assistants that read aloud the specified content, e.g., e-mails, books, messages, news articles, bus schedules, temperature and weight, etc. – and an overview of TTS-based software and hardware for visually impaired people can be found in [11].

So far, however, we have found no information about using synthetic speech in therapy for psychological disorders. Our work proposes the use of this technique for the therapy of auditory verbal hallucinations.

3 Proposed Approach

In the proposed approach we decided to use synthetic speech instead of converted speech, as described in [13]. There were several reasons for that change:

Fig. 1.
figure 1

Two faces used by avatar.

  • we wanted the therapist to sit next to the patient and assist him/her during interactions with the avatar, so that the therapist can control the relationship between the patient and their symptom. This was hardly possible when the therapist was in another room;

  • we did not want to rely on the oral skills of the therapist. When using synthetic speech the vocal effect is fully controlled by the system;

  • voice conversion alters to some extent voice timbre and pitch; however, it usually has no impact on duration nor other speaking habits, based on which a speaker can be recognized. Therefore, there is a risk that the patient will associate the avatar’s voice with the therapist, which would be highly unwanted;

  • using converted speech required two separate rooms, which can sometimes pose logistical problems.

Following these presumptions we decided to set up a proof-of-concept implementation and ran initial experiments with the patients. The system was developed to use the Polish language as the patients were Polish native speakers.

3.1 Proof-of-Concept Setup

We proposed that the therapeutic session took place in a single room, so that the therapist was able to talk with the patient face to face. The patient sat in front of the screen with the avatar displayed and watched the animations, which were discretely controlled by the therapist. The therapy session was divided into two phases:

  1. 1.

    In the first (offline) one, the therapist prepared an individualized set of prompts for a given patient (i.e., with content of hallucinations) and the video files with animations were generated.

  2. 2.

    In the second one, the patient, accompanied by the therapist, interacted with the avatar, which played back a required animation in a way that was controlled by the therapist.

3.2 Speech and Video Generation

For our proof-of-concept implementation we used a VSS engine developed for Polish [14] with two faces (Fig. 1). The system was based on the XFace toolkit [2] – an open source tool for the development of 3D talking heads implemented by FBK-irst in Trento, Italy, that supported both MPEG-4 muscle deformation and keyframe interpolation based animations. In our approach the VSS engine was driven by synthetic speech sampled at 16 kHz. We decided to use a number of different commercial TTS systems (unit selection-based or HTS-based) to get a naturally sounding voice that would be fully intelligible to the patients. XFace as the input required a sequence of visemes with the timestamps provided for each lip movement in the form of a .pho file. The overall system design is presented in Fig. 2.

Fig. 2.
figure 2

Avatar’s system design.

3.3 Speech and Video Synchronization

The drawback of the commercial TTS software was the lack of phoneme timestamps, which are required to control the face movements. To overcome this problem we used another TTS engine – the freely available eSpeak [1], which generates poor quality speech, but with timestamps. Next, we used the dynamic time warping (DTW) algorithm [19] for voice alignment between the speech signal generated by the commercial and eSpeak TTS systems. To improve this alignment we normalized both speech signals, using a time-frequency automatic gain control (TF-AGC) algorithm implemented by D. Ellis [9].

The DTW algorithm was applied to a sequence of the root mean square (RMS) energy features calculated on 50 ms frames on the normalized signals. The result of this alignment was used for mapping the timestamps generated by eSpeak TTS to the speech generated by the commercial software. Next, phoneme to viseme mapping was performed, according to the description in [14], using an inventory of 12 visemes. Next, the XFace animation was generated.

The animation together with the synchronized voice were recorded using screen recording software called recordmydesktop, which is available on the Linux platform. When needed, the video file was edited using a video editor. The video modifications applied were only minor, e.g., the image was rescaled if it was know that the patient’s auditory hallucination was accompanied by a slim or a corpulent face.

4 Results Assessment and Discussion

To assess the initial results of the therapy of auditory hallucinations we asked six patients, who had either finished their therapy or were in its final stage, to fill in a questionnaire. Similar to other studies (e.g., [18]), we used a 5-degree Likert scale [12], which is widely used in the assessment of patient satisfaction. We exposed the patients to 13 statements on the quality of the speech and animation as well as about how helpful the therapy was, etc. The patients were asked to decide whether they strongly disagreed with the statement (scored as 1), they disagreed (scored as 2), they had no opinion (scored as 3), they agreed (scored as 4), or they strongly agreed (scored as 5). The full list of questions with the average results is displayed in Fig. 3.

Fig. 3.
figure 3

Results of initial evaluation of avatar therapy by patients.

A somewhat similar set of statements was presented to the therapist – during this initial stage of the experiment the therapy was conducted by a single psychiatrist. This set of statements was extended to include statements about the therapist’s experience as a user of the system: how friendly the tool was and would additional functionalities be needed, etc.

The results from the patient survey showed that the therapy using the synthetic speech was accepted by all the patients taking part in the experiment. All of them found that the therapy helped them (score: 4.7); what is more, five out of six of the patients claimed that they like talking to the avatar (score: 4.2). All of the patients said that the speech generated by the avatar was intelligible. A high score (4.7) next to the statement: “Avatar’s utterances are natural and fluent” is proof of the high quality of the TTS systems used. Synchronization of the animation was also highly assessed and judged as “fluent” – the score yielded here was 4.7. In the comments the patients stressed that the therapist’s presence next to them during the sessions with the avatar was very important.

In contrast, scores referring to how the avatar’s face fit the hallucinations were lower – it yielded 3.8. This result was not surprising, since no advance face modification was used, apart from basic operations, such as re-scaling. The answers from the therapist confirmed that the proposed technique was highly effective in the treatment of auditory hallucinations and, if developed into a fully functional system, it would be a valuable tool.

5 Conclusions and Future Work

The proposed avatar solution with the use of synthetic speech turned out to be highly promising. All of the patients that used the solution claimed that the avatar-based therapy was helpful and that after the therapy the auditory hallucinations were either less severe or the patients learned how to cope with them in a better way. One of the patients wrote it the survey: “Thanks to the avatar I understood better my hallucinations and now most of them are gone." They also commented that the therapist’s presence during the sessions with the avatar was highly supportive.

All of the patients accepted the quality of the synthetic speech and found it easy to understand, and that it was “fluent and natural.” We suspect that a somewhat “non-natural” origin of the synthetic speech seemed to suit well the “non-naturalness” of the voice hallucinations. To the best of our knowledge, this work is the first to show the use of synthetic speech for therapy of psychological diseases.

In future work, following the patients’ and the therapist’s suggestions, we plan to develop a fully functional system that would offer the therapist full control over the system (including the possibility to quickly generate new sentences), higher quality animations and, possibly, a mobile version for a patient.