Keywords

1 Introduction

Singing Voice Synthesis Systems, as well as Speech Synthesis Systems, have been developed over several decades. Though the general approaches to Singing Voice Synthesis and Speech Synthesis are similar, singing voice is very different from spoken voice in terms of its production and perception by a human. Intelligibility of phonemic message in speech is very important, while in singing it is often secondary to the intonation and musical qualities of the voice. During singing voice synthesis it is important to convey singing voice phenomena, such as vibrato, jitter, drift, presence of singer’s formant, and others.

Singing Voice Synthesis systems utilize different approaches, as presented in Sect. 2, using different amounts and different representations of voice units. But until recently, there was no publicly available singing voice database which would allow research on singing voice phenomena and formulate necessary and sufficient content of Singer’s database for high quality singing voice synthesis. The present work describes the first (to the best of the authors’ knowledge) annotated singing voice database, which was initially released in 2012. This database will allow the study of how various voice phenomena and effects are represented by spectral, temporal, and amplitude characteristics, as well as to create a simplified singing voice synthesis system.

The rest of the paper describes the database (Sects. 3 to 5), ending with our conclusions and possible uses of this database in Sect. 6.

2 Previous Work

Approaches to Singing Voice Synthesis systems include an articulatory approach [1], a formant synthesis [2, 3], and a concatenative and corpus-based synthesis [4,5,6]. In contrast to TTS-synthesis systems, where the input is a text and the output is a speech signal, for Singing Voice Synthesis systems the input is usually a musical score with lyrics, and the output is a synthesized singing voice signal.

The articulatory singing voice synthesis system SPASM [1] maps physical characteristics of the vocal tract to singing voice characteristics and produces a voice signal. The input to this system is not musical notes but the vocal tract characteristics. The system requires the user to have a knowledge of music and musical acoustics. For each note the user should specify seven parameters, including vocal tract shape (radius of each tract section), tract turbulence (noise spectrum and localization), performance features (random and periodic pitch), and others. The system takes into account singing voice phenomena, but the singing voice does not sound realistic.

The formant singing voice synthesizer CHANT [2] works with the English language. It is based on rules derived from signal and psychoacoustic analyses, such as the automatic determination of the formant relative amplitudes or bandwidths, or their evolutions depending on the variation of other external or internal parameters. CHANT uses an excitation resonance model to compose a singing voice signal. For each resonance, a basic response is generated using Formant Wave Functions, then these signals are summed to produce the resulting signal. The system’s synthesis results are impressive in some cases, although it is said that this require tedious manual tuning of parameters.

Another formant singing voice synthesizer Virtual Singer [3] supports several languages, including French, English, Spanish, Italian, German, Japanese, and others. Virtual singer is an opera-like singing synthesizer. Its main attributes are the wide amount of languages that the synthesizer supports, the sound-shaping control (timbre and intonation), and the RealSinger function, which allows defining a Virtual Singer voice out of recordings of the user’s own voice. The singer’s database of Virtual Singer includes the set of phonemes with additional first parts of the diphthongs, represented as spectral envelopes. It assumes that only three to six formants are sufficient to generate a phoneme with acceptable quality. The advantage of this method is that only a small amount of data is required to generate a phoneme, and it is far easier to modify these data slightly to produce another voice timbre. However, the result is generally less realistic than with recorded speech elements.

The MaxMBROLA [4] is a concatenative synthesis system. It supports 30 languages and has 70 voice databases. MaxMBROLA is a real-time singing synthesizer based on the MBROLA speech synthesizer. It uses the standard MBROLA acoustic base which includes diphones and conveys singing voice phenomena by modifying the voice signal.

Another concatenative singing voice synthesis system – Flinger [5] – supports the English language. The singer’s database of Flinger includes 500 segments of the consonant-vowel-consonant (CVC) structure: 250 on low pitch and 250 on high pitch, which is about 10 min of singing voice signal. The units are represented using Harmonic Plus Noise model. The system supports the following singing voice effects: vibrato, vocal effort, and variation of spectral tilt with loudness (crescendo of the voice is accompanied by a leveling of the usual downward tilt of the source spectrum).

Vocaloid [6] is currently considered as the best singing voice synthesizer for popular music. It supports the English and Japanese languages. It’s a corpus-based system with possible pitch and duration changing with signal generation from the sinusoidal model. Singer’s database includes natural speech segments. It should contain all the diphones (pairs of CV, VC, VV for English, where C is a consonant, V is a vowel) and can contain polyphones as well. The size of singer’s base is 2000 units per one pitch.

3 Singing Voice Database Content

The main goal of creating the database was to represent different singing voice phenomena rather than a full set of phonemes for a particular language. That’s why the a-priori database cannot be used for a full-fledged singing voice synthesis (where the input is a musical score and lyrics), rather for a simplified singing voice synthesis, where the input is a musical score.

The Singing Voice Database (SVDB) includes two parts: (1) Singing musical scale recordings and (2) Singing song recordings. The first part includes:

  • 1.1. The scale (musical notes) performed using “ah” vowel (“ah-ah-ah” recordings)

  • 1.2. The transitions between notes performed using “ah” vowel

  • 1.3. The scale (musical notes) performed using “la” syllable (“la-la-la” recordings)

  • 1.4. The transitions between notes performed using “la” syllable.

The second part includes just the song “Twinkle, twinkle, little star” [7].

Both parts contain plain recordings and recordings with special singing expressions described in Table 1. The database contains vocal recordings and also the so-called “glottal” recordings, which are made by placing the second microphone on a neck of the singer near the glottis.

Table 1. Singing expressions.

4 Singing Voice Database Recording

All the recordings were performed in a studio by professional singers. For both the musical scale and the song one female and one male voice were recorded. For the musical scale the voices of Bonnie Lander [8] and Philip Larson [9] were recorded. For the song the voices of Grammy Award winner Susan Narucki [10] and Philip Larson were recorded. The singers’ voice characteristics are given in Table 2.

Table 2. Singers voice characteristics.

Both vocal and glottal recordings were made simultaneously. The air microphone was used for vocal recordings and a contact microphone was used for glottal recordings.

The recordings are in WAVE PCM format with the following characteristics: 44100 Hz; 16 bit; 1 channel (mono).

5 Singing Voice Database Processing

The musical scale recordings were processed in the following way:

  • pauses between recordings (unvoiced fragments) were automatically identified and marked,

  • pitch annotation of voiced fragments was made.

For voiced/unvoiced fragments identification the following algorithm was used:

  1. 1.

    For each 5 ms audio frame with a step of 1 ms:

    1. 1.1

      The zero-crossing rate was calculated using the formula:

$$ Z_{n} = \sum\nolimits_{m = 0}^{N - 1} {\frac{{\left| {sgn\left[ {\left( {x(n - m + 1} \right)\left] { - sgn} \right[x\left( {n - m} \right)} \right]} \right|}}{2N}} $$
(1)
  • where N – frame size,

  • x(n) – signal at the n-th sample,

$$ sgn\left[ {x\left( n \right)} \right] = \left\{ {\begin{array}{*{20}c} {1,x\left( n \right)\, \ge \,0} \\ { - 1, x\left( n \right)\, < \,0} \\ \end{array} } \right\} $$
  1. 1.2.

    The energy was calculated as a root-mean-square level:

$$ E_{n} = \sqrt {\frac{1}{N}\sum\nolimits_{m = 0}^{N - 1} {\left| {x\left( {n - m} \right)} \right|}^{2} } $$
(2)

To calculate energy on each frame the Hamming window was used.

  1. 2.

    To smooth out the result the median value with the window size equal to 7 was calculated for each Energy and Zero-crossing rate.

  2. 3.

    The frame is considered to be voiced if

$$ Z_{n} \, < \,Z_{th} \,and\,E_{n} \text{ > }E_{th} $$
(3)

where Zth is a zero-crossing threshold and Eth is an energy threshold. The values for threshold were chosen experimentally Zth = 40 and Eth = 0.06

Pitch annotation algorithm was based on the fact that the recordings contain consequently sung notes. For example, the consequence of notes C4, D4, E4, F4 corresponds to the fundamental frequencies 261.63 Hz, 293.66 Hz, 329,63 Hz, 349,23 Hz. It means that the length of pitch period changes gradually. For automatic pitch annotation software the initial fundamental frequency (F0) was specified manually. The software then finds and marks as a pitch period border the nearest zero crossing point in a singing voice signal, taking into account the length of the previous pitch period and voiced/unvoiced parts of the signal. The results were manually verified and corrected when needed.

For the second part of the recordings – song recordings – phoneme boundaries were semi-automatically found and all the vowel phonemes were annotated.

All the annotations were made in the TIMIT database files format [11]. For phonetic transcription the ARPABET code [12] was used. The phonetic transcription of the whole song as well as all the vowel phonemes marked in recordings are presented in Table 3.

Table 3. The lyrics and phonetic transcription of a song “Twinkle, twinkle, little star”.

The resulting singing voice database has the following characteristics:

  • Part 1—“Ah-ah” and “La-la” recording of a male and a female voice. The overall length of the male voice recordings is 23 min and the female voice recordings is 33 min.

  • Part 2—song recordings of male and female voices. The overall length of the male voice recordings is 13 min and the female voice recordings is 15 min.

6 Conclusions

The singing voice database described here is publicly available from [13] and [14]. The database was first released in 2012 and is quite popular for research groups in Europe and America.

The advantage of the database created is that it includes not only so-called “plain” singing, but also singing with different expressions. It has both vocal and glottal recordings made simultaneously. It is partly annotated on pitch and phoneme levels. All these characteristics make it possible to use the database for different types of research, as well as for simplified singing voice synthesis.

Indeed, this database can be used to research different singing voice effects, including, but not limited to:

  • interconnection of vocal and glottal singing voice signals,

  • acoustic phenomena which take place in singing voice,

  • different acoustic phenomena and effects of different expressions in singing voice, and

  • comparison of singing voice phenomena and acoustic effects for different singers.

The first part of a database can be used for a singing voice synthesis as well. However, because it includes just “Ah-ah” and “La-la” sounds, it cannot be used for a full-fledged singing voice synthesis, as was mentioned before. But it can be successfully used for singing voice synthesis where the input is just musical notes (without lyrics).