1 Introduction

Vocal tract (VT) recording and modeling has been the subject of investigation of many research groups. However, the complexity of the inner structure of the VT is such that the acquisition of sound generated in the VT remains a challenge to us. A variety of instruments have been implemented to record dynamic articulations during speech, each of which has its own advantages and disadvantages. However, none of these instruments has the ability to record data containing all the information of the articulators. Alongside X-ray, computed tomography (CT), and magnetic resonance imaging (MRI), ultrasound imaging is one of the four major medical imaging techniques [17, 18]. Each of these four instrumental techniques has its particular advantages and disadvantages in respect to recording articulation. Ultrasound imaging technology, which is widely used in clinics, has the advantages of being convenient, safe, fast, and offering real-time scan results. However, due to the particularity of the imaging mechanism, ultrasound images are noted for their extensive speckle noise and provide limited information of the subject’s articulator [2]. On the other hand, although electromagnetic articulography (EMA) [6, 13] data contain the precise location of sensor information, they lack complete information of the surface of the tongue.

Therefore, in our experiment, the EMA and ultrasound systems are used simultaneously as a complementary pair to record tongue movement as the ultrasound images provide the EMA data with a complete tongue contour while the EMA data offer additional key point information to the images such as information about the upper and lower tooth, lips, and tongue tip. Information about these parts plays an important role in complementing the information pertaining to the tongue and finding the relationship between different ultrasound image frames. The ultrasound images and the synchronized audio were obtained by a portable ultrasound system with the data collection software that was developed by our team. The EMA system was used to collect the flesh-point information of speech articulation in synchrony with the audio. Then, the ultrasound images and the EMA data were registered and matched by using the audio stream. High speech cameras were also used to collect facial information. In total there are four modalities of data sources that were synchronized and recorded together. The ultrasound images and the EMA data for each time point were also integrated spatially.

The multi-modalities articulatory data can be applied for vocal tract visualization, speech training, and silent speech recognition applications. After recording the multi-modality data, an active shape model-based approach was proposed to model the articulatory data.

The paper is organized as followed. Section 2 introduces the acquisition system, including the hardware and software system. The analytical process and the procedure of integrating the ultrasound and EMA data are described in Section 3. The active shape model-based tongue shape reconstruction approach is presented in Section 4. The conclusion is given in Section 5.

2 Data acquisition system

2.1 Brief description of the system

The acquisition system mainly includes four parts, i.e., the portable ultrasound system for collecting ultrasound images, EMA system for collecting EMA data, and audio system for collecting the synchronizing audio signal for both of these systems. The main components of the system are shown in Fig. 1.

Fig. 1
figure 1

a EMA system b ultrasound system c Roland Octa-Capture UA-1010

Details of the equipment are as follows:

  • Portable ultrasound system: a Terason T3000 ultrasound system with a 8MC3 micro-convex transducer and a stand for the transducer

  • WAVE system: the WAVE system of Northern Digital Inc. with 8 channels is an instrument for collecting EMA data. The WAVE system contains: a field generator, mounting arm, system control unit, system interface unit, micro-sensor, and audio synchronization cable.

  • Audio system: the audio interface Roland Octa-Capture UA-1010, a Studio Project CS5 condenser microphone, two 6.5 mm to 3.5 mm audio adapters for connecting the audio interface and the laptops

  • Helmet and stand: in order to stabilize the ultrasound probe, we developed a helmet-based stand and a magic arm-based stand for stabilization during data recording [15]. The helmet is constructed of a special material usually used in head surgery, to reduce its weight. The magic stand and helmet stand can be used for different situations and purposes. The helmet-based probe stand is able to stabilize the ultrasound probe to a greater extent than the magic arm-based stand. However, it could affect the quality of the recorded facial information. In this work, we selected the magic arm-based stand for experiments.

2.2 Hardware setup of the acquisition system

During experiments, it is unavoidable that the subject’s head will move. Thus, we need a helmet to fix the probe to the subject’s chin. Ultrasound-based systems, which have been developed to date, include ESPCI [5], HATS [16], or Palatron [12]. Our acquisition system used to employ a helmet for stabilizing the ultrasound probe, but this approach caused the subject to become tired after prolonged experimental recording sessions. Thus, we replaced the helmet by a stand with a plier-like instrument to stabilize the probe, as shown in Fig. 2.

Fig. 2
figure 2

Photograph showing data being recorded with the acquisition system

2.3 Software part of the acquisition system

We attempt to obtain three sets of data from the experiments: ultrasound images, EMA data, and data synchronized with audio signals. The ultrasound image acquisition program was developed based on the SDK of Terason Ultrasound System to which selected new functions have been added, such as file storage, information display, and synchronizations between imaging and audio. The audio and image streams are processed in parallel using multithreading programming techniques [15]. A timestamp is attached to each image to facilitate synchronization with the audio signal and also the EMA data.

The EMA data includes the trajectories of the flash points and audio files provided by the EMA system.

3 Vocal tract movement data acquisition

3.1 Description of experiment

Firstly, we set up the EMA system and place the magnetic field generator in position. Then, we start attaching sensors to the subject. All the sensors are cleaned with alcohol and pasted to 11 points on the subject (left ear, right ear, nose, tongue tip of mid-sagittal plane, tongue blade of mid-sagittal plane, a point between the tongue blade and tongue dorsum, tongue dorsum of mid-sagittal plane, upper teeth, lower teeth, upper lip, and lower lip) to collect articulatory data. Two more sensors are attached to the ultrasound probe as references. The sensor positions on the tongue and those pasted to the probe are shown in Fig. 3. Once the sensors are attached to the subject, their status is detected by the software together with the location or connection. Adjustment needed to ensure all sensors are functioning. Secondly, after the status of all sensors status is checked, the ultrasound system is started to check whether the ultrasound probe is working properly. Then, the subject positions his head on the ultrasound probe, which is attached with glue to improve the quality of the images, by adjusting their head slightly. Finally, guided by the live images on the monitor, the image parameters (including image size, image depth, gain, compression ratio, noise suppression, etc.) are adjusted to ensure that clear tongue contours are observed with a more stable and higher acquisition frequency.

Fig. 3
figure 3

a Sensors positions on the tongue and b sensors on the probe

After finishing the preparation, the recording starts and the data is recorded sentence by sentence. A beep sound prompts subject to start reading.

The corpus is a Chinese speech database designed for speech synthesis applications (the database is named Corpus of speech synthesis of the National “863” Project) [11]. For this experiment, 350 sentences were selected from the corpus in which about 8–15 s are spent reading each sentence at normal reading speed. The whole recording process lasts approximately one hour.

3.2 Structure of collected data

The ultrasound system collects ultrasound images and synchronizes audio files. The images are in bitmap, 8-bit gray scale, with a resolution of 640 × 480 with a name, including a timestamp, given by the system clock. The frame frequency of the image stream is 60 fps. The audio file has a size of 16 bits and the sampling rate is 44.1 kHz.

The EMA system generates ‘.raw’ and ‘.wav’ files after data collection in which it stores the articulatory data and speech sound, respectively. In ‘.raw’ files, the sensor trajectory and the corresponding audio alignment is done through the EMA equipment. The sampling frequency for the sensor movement is set to 100 Hz by default. As with the ultrasound system, the audio file is recorded with a sampling rate of 22,050 Hz.

3.3 Date analysis and fusion

3.3.1 Synchronization

Two groups of audio files are recorded during a single acquisition procedure which is synchronized with the ultrasound images and EMA data, respectively. The two audio signals are generated by the same microphone and is transferred from the audio interface to the two computers; hence, theoretically the two audio files should be identical and have the same length. Although the impedance difference between the soundcards of the different computers causes an amplitude difference, the basic features of the two audio files are similar. After we locate the beep sound position and use it as a reference, the two sets of audio files can be completely co-registered. With the two audio files, the ultrasound images and EMA data can be aligned on the time axis. Figure 4 shows the waveform of the audio files for a certain sentence in Chinese for both data sets.

Fig. 4
figure 4

Signals of two different audio files corresponding to the EMA data (top) and ultrasound data (bottom)

3.3.2 Fusion of multi-modal data in the space

As described above, the EMA system collects articulatory data and stores it in ‘.raw’ files which contain 5D information including the position and the rotation for each sensor at each timestamp. The ultrasound system collects ultrasound images of the mid-sagittal plane of the tongue in 640 × 480 bitmap image format which can be considered as a 2D matrix. In order to fuse the two different kinds of data together, we need to find the relationship between the two coordinate systems and perform a coordinate transformation to incorporate them into the same coordinate system.

  1. 1)

    Finding the mid-sagittal plane:

We have to locate the mid-sagittal plane so as to fuse the EMA data and ultrasound images together. The mid-sagittal plane for the EMA data can be determined by using the three points (left ear, right ear, and nose) as the reference after compensating for the movement of the head. We specify the points on the left ear, right ear, and nose as point A, B, and C, respectively. We obtain the mid-sagittal plane of the head by using the three points A, B, and C. Points A and B form the vector \( \overrightarrow{AB} \), which is perpendicular to the mid-sagittal plane, whereas point C is located on the plane.

We obtain the three-dimensional (3D) coordinates of the points A, B, and C from the EMA data with which we can obtain the equation of the mid-sagittal plane:

$$ ax+by+cz+d=0 $$
(1)

where a, b, c, and d are constants, determined by the coordinates of the points A, B, and C.

The intersection point of the plane and the connection line of the two ears (vector\( \overrightarrow{AB} \)) is set as the origin of the plane-coordinate system. The direction from the origin to the point of the nose (point C) is the direction of the X-axis. The Y-axis is perpendicular to the X-axis, pointing down from the nose to the tongue.

To convert the 3D EMA data to 2D data, we project the other EMA points onto the mid-sagittal plane and obtain the plane coordinates. After given a point \( \overrightarrow{P}\left(X,Y,Z\right) \), we can obtain its projection point \( {\overrightarrow{P}}_0\left({X}_0,{Y}_0,{Z}_0\right) \) on the mid-sagittal plane. \( \overrightarrow{P_0} \) is calculated by the following formula.

$$ {\overrightarrow{P}}_0=\overrightarrow{P}+\left[ at,bt,ct\right] $$
(2)

Here, a, b, and c are constant in Eq. (1) and t is defined by:

$$ t=-\left( aX+bY+cZ+d\right)/\left({a}^2+{b}^2+{c}^2\right) $$
(3)

where X, Y, and Z are the coordinates of point \( \overrightarrow{P} \) and a, b, c, and d are constant in Eq. (1).

  1. 2)

    Transforming the coordinates:

After determining the mid-sagittal plane of EMA data and transferring the points onto it, we can say, the EMA and ultrasound coordinates are in the same plane. Simple coordinate transformation is used here by performing three steps: scaling, rotation, and translation.

Once the point \( {\overrightarrow{P}}_0 \) on the EMA coordinate is known, we can use the coordinate transformation as follows to obtain the new coordinate\( {\overrightarrow{P}}_u \)in the ultrasound coordinate system.

$$ {\overrightarrow{P}}_u=s\times \left({\overrightarrow{P}}_0\times r\right)+t $$
(4)

where s, r, and t are parameters corresponding to the values of scaling, rotation, and translation, respectively, s is a constant that represents enlargement or reduction during the coordinate transformation, r is the degree representing the angle of rotation, and t is a vector representing the distance and direction of translation.

Among the three parameters above, the scaling value s is a certain value which can be decided in the following way. The unit length is 1 mm in the EMA system. In order to calculate the value of the scaling parameter, we need to know the unit length of the coordinate system of the ultrasound image which corresponds to 1 pixel in an image. When we configure the parameters of the ultrasound image, we set the depth (the height of the sector) as 8 cm as is shown in Fig.5. This enables us to calculate the unit length of the ultrasound image, which is 0.223 mm, and the scaling parameter s is 4.4875 in the coordinate transformation.

  1. 3)

    Locating the reference image and adjusting the other images automatically:

Fig. 5
figure 5

Scan depth of ultrasound signal for the ultrasound imaging system as indicated by the distance between the two red lines

Firstly, a reference image corresponding to the silent posture is selected and then the parameters of translation and rotation are manually selected to match the EMA data to the ultrasound image. For the same sentence shown in Fig. 4, the reference image is chosen and the mapping result is shown in Fig. 6. The rotation degree is −20° and the translations along the X- and Y-axes are −10 and 90 pixels, respectively.

Fig. 6
figure 6

Reference image and mapping result. The red, white, green, and yellow points represent the four sensors pasted on the tongue. The blue line is the result of interpolation between the four points

After selecting the reference image, we obtain the position of the probe as a reference. For the other images, the translation of the probe means the translation of the image. Thus, the position of each sensor is determined by subtracting the translation of the probe as a complement. This approach enables us to reduce the effect of the moving probe. Figure 7 shows the result before and after the complementary process, whereas Fig. 8 shows the mapping result of the sentence as a function of time.

Fig. 7
figure 7

Mapping result a without and b with the complementary process

Fig. 8
figure 8

Mapping results of the sentence at different times

3.4 Validation

Verification of the precision of the method requires us to quantify the distance between the EMA mapping results and the real contour of the tongue. A tool named EdgeTrak is used here for extracting the contour of the tongue semi-automatically [10]. Although EdgeTrak needs several manually selected contours as a reference, which will undoubtedly introduce errors, it is a preliminary and a convincible method for validating the mapping results.

In total, 365 ultrasound images are used for calculating the average error using the following equation:

$$ Er{r}_{ave}=\frac{{\displaystyle \sum_{i=1}^k\sqrt{{\left({x}_i-{x}_c\right)}^2+{\left({y}_i-{y}_c\right)}^2}}}{k} $$
(5)

where (x c , y c ) and (x i , y i )are the coordinates of the reference points on the labeled contour of the EMA points respectively, k represents the total number of points, and the average error is 1.8 mm.

3.5 Date set description

In this study, we recorded data for three subjects. One of the subjects recorded 350 sentences, which included a one-hour dataset. The other two subjects each recorded 100 sentences that included 20–30 min of data. The corpus we used is a Chinese speech database designed for speech synthesis applications as described in the previous section. The data was preprocessed by performing de-noising and data cleaning. The annotation was conducted manually.

4 Modeling the tongue movements

4.1 Active shape model

Active Shape Model (ASM) has been successfully used to automatically track objects from images. ASM was proposed by Cootes and Taylor in 1995, as a statistical point distribution model (PDM) [4]. The shapes of the object are represented by a set of points (controlled by the shape model). The ASM algorithm is targeted at matching the model to an unseen image. This approach has been widely used to analyze images of faces [9], mechanical assemblies, and medical images (in both 2D and 3D).

An ASM describes the image shape of the object of interest by obtaining a statistical shape model in examples from a training set. ASM minimizes the difference between the synthesized image from the model and an unseen image by tuning the model parameters, when it is applied to image interpretation or segmentation [1, 3, 8, 14, 19, 20, 21]. The vocal tract shapes obtained from articulatory images can be applied to acoustic simulations [7].

The ASM was built in the following steps of our study [3]. Before implementing ASM, the contours on the mid-sagittal ultrasound images of the tongue shape were semi-automatically annotated for both static vowels and vowel-vowel (VV) sequences by the tool EdgeTrak [10]. In order to find the relationship between different frames, the EMA points are used as identical points in different ultrasound images which also play the role of segmenting the tongue contour into a small piece from the tongue dorsum to the tongue tip as shown in Fig. 9. All those labeled images were adopted to form the training set. In the training set, n evenly distributed points were used to describe each contour of the tongue where n = 41. We define x i  = (x 1 ,x 2 ,,x n ,y 1 ,y 2 ,,y n ) as the i-th contour, where (x k , y k ) are the coordinates of the k-th point.

Fig. 9
figure 9

Tongue contour a before and b after segmentation by the EMA points (tongue tip and tongue dorsum)

Firstly, we calculated the covariance matrix of the adjusted shape vectors. The covariance matrix is defined as follows:

$$ S=\frac{1}{N}{\displaystyle \sum_{i=1}^N\left({x}_i-\overline{x}\right){\left({x}_i-\overline{x}\right)}^T} $$
(6)

where \( \overline{x} \) is the mean shape of all the vectors in the training set [4].

Secondly, we calculate the eigenvalue sequence of S = (λ 1 , λ 2 , , λ n ), where λ i  ≥ λ i + 1, and where i = 1,2,…n-1. We choose the first t eigenvalues under the following conditions

$$ \raisebox{1ex}{${\displaystyle {\sum}_{i=1}^t{\lambda}_i}$}\!\left/ \!\raisebox{-1ex}{${\displaystyle {\sum}_{i=1}^m{\lambda}_i}$}\right.\ge 90\% $$
(7)

Then, we calculate the corresponding eigenvector of the first t λ i to form P, recorded as P = (p 1 , p 2 , …, p t ).

After we obtain the mean shape vector \( \overline{x} \) and the eigenvector P of the training set, our tongue shape model can be expressed as\( x=\overline{x}+\mathbf{P}\mathbf{b} \). Thus, we can obtain a b vector of a certain shape through the known model, b = (b 1 , b 2 , …, b t ). At the same time, if we have a known b, we can also obtain a certain tongue shape.

4.2 Experiments and results

4.2.1 Initialization of the ultrasound images and tongue labeling

In our experiment, the training set contains a total of 145 mid-sagittal ultrasound figures of/a/,/i/and/u/, which include dynamic tongue shapes in articulation. A total of 41 points were used to represent each contour of the training set.

4.2.2 PCA analysis of the training data

After processing the training set, we conducted the PCA analysis to extract the parameters of the following model and calculate the mean shape of the training set. The result is presented in Table 1.

Table 1 Contribution of the first two factors

We choose the first two eigenvalues to be the main factors; the accumulated contribution rate reached 95.61 %, as shown in the Table 1. Then, we calculated the corresponding p i (i = 1,2), and obtained one of the parameters P = (p 1 ,p 2 ) of the ASM of our dataset. The tongue ASM was built by these coefficients.

4.3 Synthesizing tongue shape by using ASM

The tongue shapes were synthesized by using the ASM model described in the last section. The results are shown in Fig. 10, in which the reconstructed tongue shapes are compared with the original annotated tongue shapes of three isolated Chinese vowels/a/,/i/, and/u/. Thus, these results denoted that our approach is feasible for synthesizing articulation. The main cause of the differences between synthesized and real shapes is that only the first two components have been adopted in this ASM modeling procedure. The averaged error calculated following equation (5) over 40 points along the tongue contour is 1.26 mm, where k = 40 × 145 indicates the total number of sample points along the contours.

Fig. 10
figure 10

Original and reconstructed tongue contour of vowel (a)/a/(b)/i/(c)/u/

5 Conclusions

This paper introduces our acquisition system for observing ultrasound images and EMA data, which we analyzed and combined into a single dataset. The data combination procedure involved synchronization of these two datasets using the audio files from each set as reference to determine the alignment image with both the EMA information and ultrasound image. A dataset that included ultrasound and EMA data recorded by three subjects was built by using this data recording system and data fusion protocol.

We also proposed a method to synthesize the shapes of the vocal tract by using the recorded dataset. We trained a set of parameters of the ASM-based model to control the deformation of the shape of the tongue, thereby facilitating the determination of the relationship between different frames. Furthermore, we realized the synthesis of tongue shapes by interpolating the control parameters of the ASM-based model. Finally, we evaluated our method by carrying out a comparison between the synthesized and real tongue shapes. The results indicated that our method has the capability to reconstruct tongue shapes with errors not exceeding 1.26 mm, indicating that the system could be applied for vocal tact visualization in the future.