Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 General Overview of Speech Communication

1.1 Quality of Experience in the Context of Speech Communication

As defined by Hardy [7], a voice [speech] service corresponds to a voice interaction through a telecommunication system. Footnote 1 Two types of speech services exist, namely (1) speech communication services, which imply a conversation between a talker and a listener (or several listeners in case of teleconferencing systems) in a near “real-time” manner, and (2) streaming services (e.g. recorded messages stored on a device). These services replace the air path between two interlocutors having a face-to-face conversation. Since the success of any service depends on its QoE, the quality assessment of the corresponding speech communication system or speech processing application is required for both the developers and the telecommunication providers.

1.2 Factors of Speech Communication QoE

Even if the quality of the transmitted speech is a factor determining the QoE of speech communication systems, user’s satisfaction encloses many different aspects. According to the theoretical framework of QoE introduced in Chap. 4, the physical factors influencing the QoE are grouped into three categories: human influence factors, context influence factors and the system influence factors. The “human influence factors” here correspond to the talker’s difficulties to produce an acoustic message (e.g. aphonia) and the listener’s difficulties to understand this message (e.g. hearing impairments). Since humans can use speech services in very diverse situations, especially with the massive introduction of mobile terminals, the last category, “context influence factors”, covers many heterogeneous environments (in terms of time and place). The “system influence factors” include all technical characteristics, physical equipment and computer programs, of the speech service. Section 12.2 describes mainly both subcategories, “network related system” and “device-related system” of the more general system influence factor category.

1.3 Features of Speech Communication QoE

The perceived quality of telephone systems has been studied for many decades [5, 9, 28, 40]. In these studies, auditory tests have been carried out where subjects had to judge the perceived quality of transmitted speech. It resulted from these studies that speech quality, like other perceptual magnitudes, is by nature a “multidimensional” object. Researchers introduced many quality features of speech signals: intelligibility, clearness, brightness, loudness, naturalness, nearness, spaciousness, etc. For instance, a good intelligibility of the transmitted and/or processed speech is a prerequisite for a maximum quality rating of the speech service. However, a perfect intelligibility of the talker’s message at the listener’s side is not sufficient to achieve high quality. For instance, the transmitted bandwidth can be restricted to the usual telephone bandwidth, while the intelligibility remains almost perfect.

According to Möller et al. [25], the QoE space of a speech communication service covers aspects of both speech perception and service usage. In Chap. 5, the QoE features were classified in terms of four levels from perception to service usage. In the field of speech communication, the first level of quality features called “level of direct perception” corresponds to the perception by the ear of the acoustic wave and the transmission of the resulting auditory information to the central nervous system. In a conversational situation, when two conversation partners interact, the QoE of the speech service is influenced by several other features classified in terms of the “level of interaction”. For instance, this level includes the naturalness of the interaction between two interlocutors during a phone call. The third level of QoE features, the “level of the usage instance”, includes all features related to the physical and social environment at the talker’s and listener’s side. For instance, the background noise or the room reverberation at the listener’s side has an influence on the listening effort and thus on the QoE of the whole speech communication system [24]. Another example is the advantage of mobility with cordless terminals and mobile telephony. The last category of QoE features, called “level of service”, covers aspects like stability over the entire duration of the communication, call set-up duration or interruptions of the connection. This organization of quality features in four layers shows that quality features are related to both instantaneous and multi-episodic experiences of the service. All of these features play a role in the long-term acceptability of the service and the averaging process is relatively complex (see Chap. 10).

Since many speech quality features exist in the literature, several authors developed perceptual spaces based on few orthogonal quality features referred to as “speech quality dimensions”. The following section summarizes the speech quality spaces proposed in the literature. Wälterman et al. [41] combined two auditory methods to derive a speech quality space composed of the three following dimensions:

  • Discontinuity: this dimension reacts to degradation in the time domain, i.e. an unpredictable variation over time of the signal.

  • Noisiness: this dimension is affected by the amount of unwanted information added to the speech message (either noise or a second talker).

  • Coloration: this dimension can be affected by the two following elements: (1) a deviation from a reference timbre (e.g. dark or bright) and (2) a bandwidth restriction.

These three dimensions are of the type vector model. In other words, the origin of the space defines the highest quality and the space is defined by positive values only.

However, all speech stimuli employed in Wältermann et al. [41] were adjusted to a fix listening level of 79 dB SPL. Consequently, Côté [2] proposed to include a fourth quality feature to the perceptual space; loudness. Indeed, loudness is considered as the main feature of speech services QoE [5]. A loudness impairment is introduced in the case of non-optimal listening level, that is, an attenuation or an amplification introduced by the entire communication system. Loudness thus is a feature of the “ideal-point model” type. The three perceptual dimensions described by Wältermann et al. [41] are considered as orthogonal. However, the perceptual dimension “loudness” can be correlated with the other dimensions. The loudness summation effect shows that the bandwidth of a sound has an impact on its perceived loudness [4]. In Côté et al. [3], the authors showed the converse effect; the coloration due to a speech communication system has an influence on the optimal listening level.

These above described four features should reflect the whole perceptual quality space of transmitted speech. However, Sen [35] proposed a 5-dimensional space including noisiness and two sub-dimensions for each feature, coloration and discontinuity: slowly-varying and rapidly-varying discontinuities and low-frequency and high-frequency distortions. ITU–T Study Group 12 compares these two speech quality spaces within the work item P.AMD.

Fig. 12.1
figure 1

Elements composing a speech communication system. A/D refers to analog to digital conversion, AGC to automatic gain control, EC to echo compensations, CNG to comfort noise generation, PLC to packet-loss concealment, VAD to voice activity detection and to NR noise reduction

2 Speech Communication Systems

Nowadays, speech communication systems are composed of a multitude of components. The section below give an overview of the most important elements composing a speech communication system and their possible impact on its QoE. A typical example of such a system is depicted in Fig. 12.1. From the four types of speech processing systems described by Rabiner [30], only speech coding is introduced in the present chapter. Speech synthesis is covered by Chap. 13, speech recognition and speaker verification technologies are not covered by the present book.

Firstly, a telephone user talks and produces an acoustic signal, \(x(t)\). This signal is received by the microphone of the talker’s handset. However, this handset also receives sound from the environment, \(n(t)\), produced by the sound sources surrounding the telephone user. The microphone converts the acoustic signal into an electrical signal, which is digitalized (i.e. sampled and quantized in \(x[k]\), where k is the sample index) and pre-processed in order to remove the undesired signals (i.e. background noise, reverberation and echo). Then, this processed signal is encoded with a (low bit-rate) speech codec and sent to the transmission network. During the transmission to the handset of the conversation partner, the signal passes through several gateways and nodes. At the listener’s side, a continuous electrical signal is decoded with the help of several digital “post-”processing algorithms. Then, the loudspeaker of the listener’s handset converts the processed electrical signal into an acoustic signal, \(y(t)\). The listener’s acceptability of the whole communication service based (mainly, but among other aspects) on the perception of the transmitted acoustic signal, \(y(t)\), is the subject of the present chapter.

2.1 Telephony Networks

The traditional analog telephone network, referred to as Public Switched Telephone Network (PSTN), has been optimized for an almost perfect intelligibility of the speech message. For instance, the bandwidth of the transmitted speech corresponds to the transmission of the frequencies between 300 and 3,400 Hz that enables a comprehensibility of almost all phonemes. This bandwidth is, nowadays, referred to as Narrow-Band (NB). The PSTN is based on a “circuit-switched” network: the two interlocutors are connected by a physical circuit. In such a network, all physical parameters are well controlled to ensure a stable Quality of Service (QoS): the network accessibility is guaranteed and preserved over the whole call. During the last two decades, the deregulation of the telecommunications market led to heterogeneous transmission systems and speech processing algorithms. The first main transition was the introduction of digital transmission, the ISDN (Integrated Services Digital Network), which resulted in a decrease of circuit noise. Then, mobile phone networks, GSM (Global System for Mobile) and UMTS (Universal Mobile Telecommunication System) networks, have been broadly set up all over the world. The users of mobile telephony services are able to move from a quiet (house, office) to a noisy environment (street, train station) during a phone call. However, these networks are highly dependent on the characteristics of the radio channel between the mobile phone and the antenna. This air path leads to interferences, producing bit errors and frame losses, and handovers between two cells, two codecs and/or even two bandwidths, producing discontinuities in the transmitted signal. These quality variations in mobile networks result in a perceived instability of the communication system (see Chap. 27).

In addition to mobile telephone networks, speech communications over computer networks have been introduced. The Voice over IP (VoIP) protocol is based on a discontinuous transmission of packets of data, and the network is consequently referred to as a “packet-switched” network. Nowadays, the packet-switched network is the most widely used transmission path, because of its enhanced flexibility compared to the circuit-switched network. For instance, large audio bandwidths can be transmitted such as Wideband (WB, i.e. 50–7,000 Hz), Super-Wideband (S-WB, i.e. 50–14,000 Hz) and Full-Band (FB, i.e. 20–20,000 Hz) bandwidths. These wider bandwidths introduce less coloration of the speech compared to the the narrow telephone bandwidth and thus increase the QoE. For instance, a comparison of clean WB and NB transmissions shows an increased quality of 29 % in the WB case [27, 29]. However, VoIP transmissions may increase several quality impairments. For instance, these wider bandwidths may increase the influence of the environmental noise at the talker’s side, and the packetization process lengthens the overall transmission delay. A long transmission delay may introduce an audibility of the talker’s own voice (echo) and reduces talking quality and double talk capability.

A packet-switched network may introduces discontinuities in the transmitted speech message, too. This annoying degradation appears more frequently than in a circuit-switched network. These discontinuities have two origins: (1) the bit-rate allocation is not guaranteed over the whole call and (2) the packets can take different transmission paths that lead to a time-varying transmission delay. This variation in transmission delay is referred to as “jitter”. To generate a continuous signal, a buffer is placed in front of the decoder. The size of this de-jitter buffer (e.g. 120 ms) defines the tolerated lengthening of transmission delay between two consecutive packets. However, the size of the jitter buffer increases the overall transmission delay and, thus, may affect the conversation effectiveness. In case the speech segment may be lost during the transmission or arrives too late to synthesize a continuous signal, an algorithm “reconstructs” the missing packets. This algorithm called Packet-Loss Concealment (PLC) reduces the discontinuities in the speech signal. Nowadays PLC algorithms uses time-scale modifications of the speech signals (also known as “time-warping”) which enable a smooth reconstruction of the waveform and avoid any discontinuity in the speech signal.

2.2 User Interfaces

The physical interface between the customers and the transmission system can be a handset, a headset or a Hands-Free Terminal (HFT). Such acoustic terminals have an influence on the speech coloration. The timbre modification of the talker’s voice is introduced by the electro-acoustic properties of the two transducers (microphone and loudspeaker). Therefore, QoE of user interfaces is determined by their sending and receiving frequency response characteristics. In addition, loudness is a main parameter for all acoustic interfaces. According to the “orthotelephonic reference position” [12], the output signal loudness of such acoustic terminals should be equivalent to the perceived loudness of two interlocutors having a face-to-face conversation at one-meter distance.

Nowadays, the handset manufacturers introduce new services to user terminals in order, for instance, to increase the mobility of the user. For instance, screens with haptic feedback are included in modern mobile phones. Place for transducers is consequently reduced and causes challenges for their acoustic design. Although they enable a greater mobility, these terminals include several digital processing systems such as noise reduction algorithm that may degrade the transmitted speech signal [26].

Table 12.1 Characteristics of NB speech coding algorithms
Table 12.2 Characteristics of WB speech coding algorithms

2.3 Speech Coding

A speech coding algorithm is a system that reduces the network rate used to transmit the speech signal. The speech coder produces a compressed signal from the input speech signal, referred to as the bitstream. After transmission over the network, the aim is to get a synthesized speech signal as similar as possible to the original speech. The impact of the speech codec on QoE depends on three physical characteristics: (1) the bit-rate expressed in kbits, (2) the frame length expressed in milliseconds (typical ranges of frame length are 5–30 ms), and (3) the paradigm employed by the coding algorithm. Tables 12.1 and 12.2 present the characteristics of several NB and WB speech coding algorithms. Almost all speech codecs have a flat band-pass within the allowed transmitted bandwidth (NB, WB or S-WB) and a low quantization noise resulting in a perfect intelligibility of the coded speech. However, they introduce audible non-linear degradations that decrease their perceived quality and affect automatic speech and/or speaker recognition algorithms. The parameter called “equipment impairment factor” \((I_e)\), used in the E-model [13], quantifies the degradation introduced by the coding–decoding process. In addition, the coding–decoding process introduces a delay which impacts the conversation effectiveness. Nowadays, speech codecs use a simple model of human auditory perception [17], are scalable from NB to WB [18], and some modern codecs also allow for coding of both speech and audio signals [16].

2.4 Voice Quality Enhancement

Voice Quality Enhancement (VQE) algorithms are integrated into the network or even directly into the terminal to reduce the new impairments introduced by mobile or VoIP networks. These algorithms are, for examples, echo cancellation, noise reduction, de-reverberation and automatic gain control, see Fig. 12.1. Echoes of the talker’s own voice is introduced either by an acoustic feedback at the listener’s side or by an impedance mismatch at the interconnection between two networks. As already mentioned, the latter effect is exacerbated in packet-based networks due to longer transmission delay. Therefore, echo cancellation techniques are needed if the delay exceeds 15 ms. Noise reduction is another VQE algorithm that has been widely introduced in mobile terminals. It reduces the environmental noise at the talker’s side transmitted by the network. This algorithm complemented by a de-reverberation algorithm and an echo canceller separates the desired signal components from the undesired ones. However, noise reduction algorithms based on spectral subtraction reduce the noise level but simultaneously introduce musical noises on the speech signal [33]. Therefore, Möller et al. [26] proposed to describe the speech degradations resulting from imperfect noise reduction and echo cancellation by two additional equipment impairment factors Inr and Iec.

3 Speech Communication QoE Measurement Methods

The following sections introduce the measurement methods employed to quantify and further estimate the QoE of speech communication services, i.e. speech transmitted through a network and/or processed by speech processing systems. However, voice and speech quality measurement methods are employed in very diverse scientific fields: medicine (e.g. the evaluation of voice-related problems), linguistics (e.g. cultural comparisons) or speech communication. Each field has its own assessment paradigm.

3.1 Auditory Methods

The most accurate auditory measurement method would be an assessment by customers in natural environments. In practice, such “in-field” tests are hardly implemented, and speech services QoE is assessed with artificial auditory quality tests carried out in laboratories where the perception process is “directed” by an experimenter. Many different auditory test methods are employed by the academic laboratories and the speech service industries. For instance, listening-only experiments are carried out to gather the most important QoE features. Their realism is lower than that of conversational tests, since only the transmission system influence factors are assessed. The P-Series of Recommendations published by the ITU–T describe a general framework of speech communication measurement methods. In a listening quality test (referred to as listening-only test by the ITU–T), the listeners rate on a measurement scale a set of short speech samples (4–8 s) transmitted by different speech communication systems. The most widely used measurement scale is the 5-point integral quality scale presented in Table 12.3 (left column [22]). Such methodologies are not suited to compare speech stimuli with small impairments. Consequently, high-quality speech processing systems are assessed by methodologies used in the audio world and published by the ITU–R organization [10, 11] (see Table 12.3, right column).

Table 12.3 5-point scales

Most of the ITU–T and ITU–R auditory methods quantify the quality of a speech service with a single value. This value is often used as an estimation of the overall speech service QoE. In addition to these methods, more complex auditory test methods give diagnostic information about the assessed processing conditions. Such quality tests rely on either a multi-scale rating process or a multidimensional analysis of the auditory results. For instance, Voiers [39] developed a specific multidimensional scaling method called Diagnostic Acceptability Measure (DAM) which assesses quality features of speech samples. More recently, Wältermann [40] developed a similar method to assess the three speech quality dimensions discontinuity, noisiness and coloration. However, such multidimensional tests are expensive and time-consuming since the listeners are trained beforehand (experienced), and they employ several rating scales for each speech stimulus (see also Chap. 5).

3.2 Instrumental Methods

Auditory methodologies rely on judgments by test subjects who are asked to give their opinion about the quality of a speech stimulus. Since auditory tests are costly and time-consuming, instrumental methods have been developed. Instrumental methods have different applications such as the daily monitoring of transmission networks (e.g. VoIP) or the optimization of processing systems (e.g. speech codecs). They provide either a single estimated value that possibly represents the quality of the speech communication system (integral models), or a decomposition of the quality into several quality features (diagnostic models). In the following sections, we review the reliable models employed to predict the different aspects of speech communication QoE. Many building blocks have been developed such as the ITU–T Rec. P.863 [23] model which estimates the listening quality of transmitted speech. However, a tool that covers all aspects of the QoE and predicts the overall QoE of speech communication services in a single value is not available yet.

Richters and Dvorak [31] proposed a performance model based on seven quality criteria (speed, accuracy, availability, security, simplicity and flexibility) for each function of the service (sales, connection, billing, technical support, etc). This model is employed to assess the QoS of speech communication services and covers many aspects of the service usage. More recently, Möller et al. [25] organized all QoS parameters of speech communication services in a theoretical model which covers the four levels of QoE-features (perception, interaction, situation and service). For an example of an exhaustive evaluation of a speech communication service with such quality criteria, see Chen et al. [1].

Many models have been developed and standardized to estimate the quality of transmitted speech in a listening-only situation. Takahashi et al. [37] classified them in three different groups: parameter-based models that use parameters describing the elements of the system (e.g. ITU–T Rec. G.107 [13]), signal-based models that use the transmitted or processed speech signal (e.g. ITU–T Rec. P.863 [23]), and the packet-layer models that use information about the service operation (e.g. ITU–T Rec. P.564 [21]). For instance, the well-known Perceptual Evaluation of Speech Quality (PESQ) model includes a robust time-alignment algorithm useful for VoIP variable delay [32]. The PESQ is now superseded by a new listening-only signal-based model, called POLQA [23], that represents an intrusive speech quality model suitable for NB to S-WB connections, electro-acoustic interfaces and VQE algorithms. Most of these models provide an integral estimation of the quality. Recently, diagnostic models have been developed in order to either indicate (1) the technical causes of a single impairment or (2) describe the communication system QoE on few speech quality features. In the former case, diagnostic models provide useful information to system designers and operators that help them for maintenance purposes. For instance, the ITU–T Rec. P.502 [20] describes standard methods to assess each element of user terminals and network components. The corresponding test signals are described in a separate standard [19]. These methods assess characteristics such as (1) the frequency response, the sidetone, the harmonic distortion and the loudness ratings of the user terminals, and (2) the echo loss, the double talk capabilities and the background noise of the transmission networks. Even tough this first type of diagnostic models provides an exhaustive evaluation of the physical equipments, they do not help telecommunication providers to design a voice service optimized for their specific needs. The second type of diagnostic models describe a voice service in a simple quality space. They help the end-user to choose a voice service based on its cost and its QoE. The benefit of such a diagnostic model has initially been investigated by Quackenbush et al. [28]. More recently, two sets of quality-feature estimators have been developed from the perceptual quality space derived by Wältermann et al. [41]. Côté [2] improved estimators initially developed by Scholz et al. [34] and Huo et al. [8] into a signal-based model called Diagnostic Instrumental Assessment of Listening-quality (DIAL). This model provides values of the four dimensions “coloration”, “discontinuity”, “noisiness” and “loudness”.Wältermann [40] proposed a diagnostic parametric model based on the E-model [13]. In parallel, Sen and Lu [36] derived four estimators for temporally localized (slow-jitter and fast-jitter) distortions and frequency localized (low-pass and high-pass) distortions.

So far, no instrumental model has been standardized for the estimation of the speech quality in a conversational situation. However, Guéguin et al. [6] proposed such a tool that combines estimations from three other models: PESQ and PESQM for listening- and talking-only situations, respectively, and the E-model for a delay impairment factor introduced by the transmission delay. Long-term quality estimation has been studied by more researchers. For instance, Weiss et al. [42] was able to estimate a long-term listening-only speech quality score (up to 2 min.) based on PESQ estimations for 4–8 s stimuli.

4 Conclusions and Future Trends in Speech Communication QoE

In this chapter we presented both the technical elements and quality features which are relevant for the Quality of Experience of speech communication systems. We reviewed the auditory and instrumental methods suitable for speech quality assessment with a focus on diagnostic instrumental models that provide one output per QoE dimension.

Over the last decades, instrumental models have been developed on either speech or music databases. The former ones estimate the QoE of speech services such as telephony, whereas the latter ones are dedicated to audio devices such as loudspeakers or headphones [38]. The new standard model POLQA [23] has been developed for the QoE estimations of speech communication systems only. However, both streaming and telephony services now employ similar packet-based networks. This new usages encourage the researchers to develop a common model that works with both types of input signals.

Current speech quality models do not cover the influence of the listener’s acoustic environment. Indeed, listening through a handset in a noisy environment involves binaural hearing which is not covered by current models. Even though many studies have been published over the last two decades, the effects of binaural hearing are still unclear and difficult to include in such quality models.

Further work is thus expected in the development of reliable instrumental methods. However, such instrumental methods require, at first, auditory test results. Therefore, the community of researchers who works in the field QoE would appreciate collaborations with voice service providers and developers of speech processing systems to get access to databases including specific impairments and/or listening contexts.