The ability to process salient emotional and social cues is critical for adaptive behavior. A failure to process expressions of emotion adequately can have important negative and long-term effects on social behavior and can be a risk factor for adaptation problems, including aggressive and antisocial behavior (Herba & Phillips, 2004). The majority of studies on emotion processing have focused on facial expressions of emotion (e.g., Pollak & Sinha, 2002; Tottenham et al., 2009). There is less research on vocal expressions of emotion, notably because of the difficulty in obtaining naturalistic recordings of vocal expressions of specific emotions (Scherer, Banse, Wallbott, & Goldbeck, 1991). Still, vocal cues play an important role in the expression of emotions. By “vocal,” we refer to “everything that remains present in a spoken message after lexical and syntactic information has been removed” (van Bezooijen, 1984, p. 1). A growing number of studies conducted in the past decade have indicated that humans, across languages and cultures, can infer emotion from vocal expression alone because of differential acoustic patterns (e.g., Banse & Scherer, 1996; Bänziger, Mortillaro, & Scherer, 2012; Castro & Lima, 2010; Juslin & Laukka, 2003; Liu & Pell, 2012; Livingstone & Russo, 2018; Pell, Paulmann, Dara, Alasseri, & Kotz, 2009; Sauter, Eisner, Ekman, & Scott, 2010; Scherer, Banse, & Wallbott, 2001).

A number of emotion corpora have been produced (see Scherer, Clarke-Polner, & Mortillaro, 2011; Ververidis & Kotropoulos, 2006, for reviews). They all have their particular features and are composed of diverse vocal stimuli. Table 1 presents a sample of data collections of vocal expressions of emotion.

Table 1 Sample of data collections of vocal expressions of emotion

We developed and validated a set of pseudo-words based on the phonology and pronunciation rules of North American English, which we aim to make available to the research and clinical communities. The corpus, named the Hoosier Vocal Emotions Corpus (HVEC), includes important unique characteristics. First, it focuses on disyllabic pseudo-words, rather than meaningful words or sentences, to remove the semantic meaning and allow for the speech prosody to become the central attribute of emotion processing (Wendt et al., 2003; Wendt & Scheich, 2002). To our knowledge, only one other corpus (the Magdeburger Prosodie Korpus, a set of stimuli respecting the phonotactic and phonetic rules of the German language) includes isolated pseudo-words (Wendt et al., 2003; Wendt & Scheich, 2002). Our corpus’s main features are based on this German corpus. Other corpora of vocal emotions contain pseudo-sentences (e.g., Castro & Lima, 2010; Liu & Pell, 2012). However, experimental paradigms can require shorter stimuli, which would be difficult to manually extract from sentences and subsequently validate separately. In addition, Rigoulot et al. (2013) demonstrated in a gating paradigm study that the length of the stimuli matters for the time course of emotion recognition, and that full sentences are recognized much more easily than truncated ones. Other corpora use affect bursts (e.g., “ah”) or emotional sounds such as screams or laughter (e.g., Belin et al., 2008; Parsons et al., 2014). Despite the high effectivity of such stimuli to convey specific emotions, they are also not necessarily suitable for experimental paradigms requiring controlled stimuli with medium or normal emotional intensity.

The Hoosier Vocal Emotions Corpus includes 73 controlled audio pseudo-words, uttered twice apiece by two actresses in five different positive or negative emotions (i.e., happiness, sadness, fear, anger, and disgust) and in a neutral tone, yielding 1,763 stimuli (some of the stimuli were pronounced more than two times). We selected the emotions on the basis of the basic emotions identified by Ekman (1992), except for surprise, because this emotion can have any valence (it can be neutral, positive or negative). In addition, surprise utterances can be difficult to simulate experimentally (Pell et al., 2009). Although concerns have been raised about the use of acted rather than natural stimuli (Bachorowski & Owren, 2008), there are also arguments suggesting that actors can produce realistic portrayals and valid instances of vocal expressions of emotion (Ververidis & Kotropoulos, 2006). One important argument is that much of our verbal communication is subject to sociocultural censure and involves making impressions on others (Bachorowski & Owren, 2008; Banse & Scherer, 1996). Therefore, having people utter an emotion as if they were experiencing it may not be significantly different from a real-life communicative situation. Two female voices were preferred over having one male and one female voice, mainly for reasons of comparability and homogeneity between such acoustic dimensions as pitch range, and to facilitate their use in experimental paradigms requiring tight control of the acoustic parameters of stimuli, such as event-related potential (ERP) studies. In this article, we describe the structure of the Hoosier Vocal Emotions Corpus, as well as the validation of the pseudo-words in terms of the emotion they portray. We also discuss potential applications of this set of stimuli.

Method

Creation of the stimuli

The stimulus set is composed of pseudo-words based on real English words. These pseudo-words were created by selecting common English disyllabic words using the COBUILD frequency information (per million) from the CELEX English Wordforms database (Baayen, Piepenbrock, & Gulikers, 1995), and manipulating the order of segments within the word (see Wendt & Scheich, 2002, or Castro & Lima, 2010, for a similar procedure). For example, the pseudo-word “elby” was constructed from the noun belly. As a result, there is no clear phonetic relationship between the pseudo-words and their originals, but they are matched in terms of number of syllable and phonemes. Care was taken to ensure that the pseudo-words were phonotactically legal—that is, that the sequences of phonemes were permitted and easily pronounceable in English. Similarly, slight phonetic adjustments were made to comply with English pronunciation rules. For example, the pseudo-word “domner,” based on modern, did not retain the flapped /d/ found in the North American English pronunciation of modern, since the flap is not found in word-initial position in English. Pseudo-words that were too clearly reminiscent of their original or of other real words were excluded. A final list of 73 pseudo-words was generated (see Table 2). Stress always fell on the first syllable, but the vowel in the second syllable was not always fully reduced (indicated by the International Phonetic Alphabet [IPA] symbols in Table 2, where only “schwa” [ə] represents a reduced vowel). The transcriptions provided in Table 2 closely reflect the actual pronunciation of most of the stimuli by both actresses. Since each actress pronounced a given pseudo-word 12 times (2 × 6 emotions), there are essentially 24 pronunciations of the same pseudo-word, thus displaying some variation from one token to the next. The transcription here reflects the most common pronunciation of the stimuli, and there might be some variation across specific stimuli, especially in terms of the vowels. Table 2 is provided here to give further guidance to researchers about the possible variations in pronunciation for the same pseudo-word, but we encourage researchers and clinicians who need an exact control of sound properties to check each stimulus they plan to use.

Table 2 List of the 73 pseudo-words included in the corpus, in the Roman alphabet and in IPA transcription

Elicitation and recording procedures

Two actresses were recruited to record the 73 pseudo-words in a neutral tone as well as in five different modal emotions: happiness, sadness, fear, anger, and disgust. Female voices were recorded as the basis of another experiment (i.e., an electroencephalography [EEG] paradigm involving young children; Hoyniak et al., 2018). Both actresses were native speakers of Midwestern United States English (North Midland dialect region; Clopper & Pisoni, 2004), and had lived exclusively in that region prior to the recording. They reported no fluency in any language other than English and have not lived abroad. They were students in the Department of Theatre and Drama at a large Midwestern higher education institution (Indiana University, Bloomington, IN) and were 18 and 20 years old, respectively, at the time of the recordings. Both actresses were paid and gave consent to share the recordings in a publicly accessible database.

Each actress (henceforth, A.G. and K.M.) was recorded individually in a single session of approximately 1.5 to 2 h. The experimenter first briefly explained the general procedures to each actress, who was also given time to familiarize herself with the list of stimuli. Pronunciation of the pseudo-words was clarified as needed. The different emotions were discussed and explained. The stimuli were elicited using a short sentence preceding the pseudo-word: ‘it starts like /word/, I say /pseudo-word/, I say /pseudo-word/ again’ (see Table 3). This was done to help maintain consistent pronunciation of the pseudo-words and to enhance fluent delivery and more natural sounding speech. In addition, this form of elicitation was chosen to enable a similar delivery context for each pseudo-word across emotions and ensure high comparability. Each pseudo-word was thus pronounced at least twice (two times per carrier sentence). For each actress, at least 146 stimuli were pronounced for each emotion, yielding a total of at least 876 stimuli per actress. However, some stimuli were pronounced more than two times, when an actress chose to reattempt the emotion portrayal for a given carrier sentence, resulting in a total of 876 pseudo-words for A.G. and 887 pseudo-words for K.M., for a grand total of 1,763 audio files. The stimuli are overall similar in terms of duration (M = 613 ms, Median = 608 ms, SD = 132 ms) and intensity (M = 62.29 dB, Median = 62.23 dB, SD = 3.849 dB).

Table 3 Example of the materials used to elicit the pseudo-words for each emotion

Actresses were allowed to choose the order in which they preferred to utter each emotion. They were then seated in a recording booth, wearing Sennheiser HD515 Dynamic Stereo headphones, and before recording a set were shown a short presentation of pictures and auditory examples of (non-English) pseudo-words spoken in the corresponding emotion (Wendt & Scheich, 2002). The pictures depicted situations in which examples of the specific emotion to be uttered were displayed. For example, various clip art pictures of angry individuals, arguing friends and knit eyebrows were shown to illustrate anger, and to clarify a general mood for each emotion. The experimenter demonstrated a few items in their carrier sentences (without modeling a particular emotion), to help with pronunciation of stimuli (fluency) and overall rhythm. The actresses were also encouraged to imagine situations/scenarios according to the emotion to be expressed. They were given as much time as they needed to “get into the character” of the emotion before proceeding with the recordings. The experimenter also instructed the actresses not to exaggerate their expressions of the emotions, but to achieve a “normal” rather than a “strong” level of emotional intensity (see Livingstone & Russo, 2018).

The stimuli were recorded in a noise-isolated recording booth, at a sampling rate of 44100 Hz with 16-bit resolution on a mono channel, using a Sennheiser e835 dynamic cardioid microphone and an Edirol UA25 USB stereo audio interface. The distance and orientation of the actresses with regard to the microphone were held as constant as possible. Each stimulus (pseudo-word) was then manually cut from its sentence context and saved separately in a .wav format for presentation in the subsequent evaluation procedures.

We conducted a validation study with approximately 25 participants rating each sound file of the Hoosier Vocal Emotions Corpus, to estimate to what extent each recorded stimulus represents an acceptable rendition of the intended emotion. We included stimuli from both actresses into the corpus validation, that is, a total of 1,763 audio files. Given the large number of audio files, the time required for a single listener to evaluate all of them would have been prohibitively long. We therefore divided the files into four stimuli lists, which were presented to listeners for evaluation. All emotions were equally balanced in each list. However, we decided against mixing the two voices in each list (see Castro & Lima, 2010, for a similar design). Each list contained stimuli from only one speaker (Lists 1 and 2: A.G., Lists 3 and 4, K.M.). This was done in order to reduce comparison between voices, and to enhance the reliance on actual acoustic properties of the stimuli. An additional consideration was the cognitive load of this task, which is demanding for participants. Each participant rated only one list. The dataset accompanying the corpus contains ratings for each audio file from about 25 persons (see below for the method details). All procedures were approved by the Indiana University Institutional Review Board.

Validation of the stimuli

Procedure

To validate the stimuli of the corpus, we opted for a forced choice identification task similar to the one used by van Bezooijen (1984) or Castro and Lima (2010). The stimuli were presented to listeners via headphones, using the Praat software (version 5.4.04; Boersma & Weenink, 2014) on computers running under Windows 7. Participants were tested individually and were seated at a computer station in a partitioned computer lab, wearing high-quality Sanako over-the-ear headphones at a self-chosen comfortable listening level. Their task was to listen to each sound file and identify what emotion they thought the speaker intended to convey. They were asked to choose one out of six possible emotions and indicate their choice by clicking on the correspondingly labeled button on the screen. The labels were “neutral,” “happy,” “sad,” “fear,” “angry,” and “disgust.” There was no “other/none of the above” option (Livingstone & Russo, 2018). Participants were also asked to choose how confident they were in their choice by clicking on a number, on a scale ranging from 1 (not sure) to 5 (very sure). The instructions were displayed on the screen as follows:

This is a judgment experiment about how actors convey emotions. You will hear an actress say non-words and your task is to choose what emotion you think it conveys. (Some non-words might be repeated a few times). Please don’t spend too much time on each non-word. Try to do it using your intuition.

In addition, we ask that you indicate how confident you are with your choice on a scale of 1 (not sure) to 5 (very sure). There are several breaks.

If you have questions, please ask now.

The buttons appeared as rectangles on a single line in the middle of the screen, and their order was randomly varied across list (but kept constant for any given participant) to avoid preference effects. The task was not timed, and listeners could replay the sound up to eight times by clicking on a repeat button (Fig. 1).

Fig. 1
figure 1

Screenshots of the Praat script interface for the recognition task. The top panel shows the first screen in a trial, where the emotion labels are highlighted (clickable). The bottom panel shows the second screen in a trial, with the confidence scale now also highlighted. The respondent’s choices appear highlighted in red, and a next button is displayed for participants to move to the next trial. The task was self-paced. Up to eight replays were allowed

The presentation order of the sound file was randomized for each participant, and the script implemented a break after every 50 stimuli. No stimulus file was repeated. The average duration of the identification task was about 45 min. As explained above, the sound files were divided into four lists to keep the duration manageable for a single participant. Each of the four lists contained roughly the same number of stimuli: Lists 1 and 2 (A.G.) each contained 438 sound files, List 3 contained 443 sound files, and List 4 contained 444 sound files (K.M.). Participants were randomly assigned to one list upon arrival in the testing room. All participants also filled out a sociodemographic questionnaire (notably to assess their age, sex, and languages spoken) administered through the Qualtrics survey software.

Participants

In all, 102 participants were tested. The testing took place between February 2016 and December 2016. Six participants were excluded for various reasons (not native speakers of English or did not grow up in the United States, multiple neurocognitive issues reported, incomplete dataset, technical failure, or more than twice the average time needed to complete the task). In total, data from 96 participants (67% female), who were between 18 and 38 years old (M = 21.09, SD = 3.21), were included in the analysis (List 1, N = 24; List 2, N = 25; List 3, N = 24; List 4, N = 23). Most of the participants were college students, and they were predominantly Caucasian. Only one participant reported not knowing any language other than English. Twelve of the participants reported growing up bilingually using English and another language. About half of the participants (53.1%) reported knowledge of Spanish, 21.9% of French, and 6.3% of German, with 13 other languages mentioned by fewer than 4% of the participants (e.g., Japanese, 3.1%). A total of 34.4% of the participants reported knowing two languages besides English, and 12.5% reported knowing three languages besides English. Two of the participants reported knowing four or more languages besides English. Aside from the early bilinguals, three participants reported high proficiency in other languages learned after the first. None reported having any kind of uncorrected speech or hearing disorder. We recruited the participants using flyers posted in public areas (e.g., various departments at Indiana University) and word of mouth. Participants were compensated for their time.

Results

To ascertain the validity of the corpus, we used two dependent variables: emotion identification accuracy rates and confidence scores (how confident the participants were in their choices). Response times (RTs) were collected on each trial but is not analyzed as a dependent variable given that the task was not speeded. Because there were six choice options on each trial, a random selection would yield an overall accuracy of 16.7%. The data were submitted to a chi-square analysis to estimate whether or not the participants were equally likely to choose among the six possibilities for a given stimulus. Table 4 provides the confusion matrix overall, across both speakers, and reveals that overall, emotion portrayals were recognized accurately. Figure 2 shows the overall median accuracy in emotion identification by the 96 participants, separated by speaker. Random performance level (~ 16%) is indicated by the dotted line.

Table 4 Classification counts of vocal emotion portrayals by the participants’ responses and the overall proportions of accurate responses (%) within each emotion, across both speakers
Fig. 2
figure 2

Overall median accuracy of emotion identifications by the 96 participants, separated by speaker. The random performance level (~ 16%) is indicated by the dotted line

Figure 2 suggests that participants were able to identify each stimulus’ intended emotion above chance.Footnote 1 The mean recognition accuracy is 45%. Sadness was recognized most accurately (M = 59%), followed by neutral (M = 51%), fear (M = 50%), disgust (M = 43%), and anger (M = 38%). The emotion that was recognized least accurately was happiness (M = 31%). All emotions were recognized better than chance for both stimulus sets, except for happiness for the K.M. stimuli, which was misidentified as neutrality more often than it was identified as happiness (see Table 6 below).

A global chi-square analysis on the chosen response categories over all data points (across emotions and speakers) was significant [χ(25) = 29,429.29; p < .001, Cramer’s V = .37]. This suggests that for each emotion, respondents did not randomly choose among the six options. Before evaluating whether this pattern holds for each emotion separately, we first examined whether there is a difference in accuracy between speakers, as is suggested in Fig. 2.

A one-way analysis of variance (ANOVA) comparing accuracy for each speaker (K.M., A.G.) revealed that mean recognition accuracy was significantly higher for A.G. (M = 48%, 95%CI = 45–50) than for K.M. (M = 43%, 95%CI = 40–45), F(1, 574) = 8.26, p = .004. This significant effect of speaker indicates that raters were overall slightly more accurate at recognizing emotions portrayed by one speaker (A.G.) over the other (K.M.). However, such differences are to be expected among voice actors, and this is unlikely to reflect an inherent difference among our listener groups. If one group of listeners were systematically less concentrated or accurate during the task, we would expect this difference to hold across the emotions for a given speaker. To verify this, a mixed-effect model with speaker and emotion as fixed factors (and participants as a random factor) was conducted in SPSS 25. Multiple comparisons were adjusted with the Sidak correction. The type III tests of fixed effects shows a main effect of speaker [F(1, 94) = 8.6, p = .004], a main effect of emotion [F(5, 470) = 37.7, p < .001], and crucially, a significant interaction between the two factors [F(5, 470) = 33.4, p < .001]. The interaction and pairwise comparisons reveals that for all emotions except disgust and neutral, A.G.’s portrayals were recognized significantly more accurately than K.M.’s; conversely, K.M.’s portrayals of disgust and neutral were recognized significantly more accurately than A.G.’s. The presence of an interaction suggests that it is unlikely to be the case that the K.M. listeners were systematically less accurate than the A.G. listeners (otherwise, one would have expected an absence of interaction).

Tables 5 and 6 provide the confusion matrices obtained for our stimulus set (emotion portrayal by participants’ choices; n = 42,306 data points), separated by speaker.

Table 5 Confusion matrix for the A.G. stimuli, with chi-square goodness-of-fit test per emotion
Table 6 Confusion matrix for the K.M. stimuli, with chi-square goodness-of-fit test per emotion

Given the significant effect of speaker and the speaker by emotion interaction, we further conducted a series of chi-square analyses (nonparametric goodness-of-fit tests) in SPSS 25 for each speaker and emotion separately, which confirmed the global analysis. The results of the tests for each emotion and each speaker are provided in Tables 5 and 6. They show that the tests were significant for all speakers and all emotions, indicating that listeners were not responding randomly.

Examination of the patterns of misidentifications in Tables 5 and 6 revealed the following tendencies. For A.G., all emotions except fear were most often misidentified as neutral, which represents the second-highest proportion of choices in these cases. In the case of fear, items were misidentified most often as happiness. However, even though, for instance, happiness was misinterpreted as neutral in 25% of the cases for A.G., the reverse was not true: Neutral items only were misinterpreted as happiness in 7% of cases, and were more commonly misinterpreted as anger or sadness, each in roughly 15% of cases (see Table 5). The error patterns for the K.M. stimuli stand out, in that happiness stimuli were most often recognized as neutral, which is the dominant, modal response. Happiness choices were given in 25% of cases, and neutral choices in 32%. For the other emotions, unlike for the A.G. stimuli, neutral was the second choice after the correct identification only for anger and sadness. Disgust was misidentified as anger in 21% of cases, more often than neutral, and fear was confused with sadness in 27% of cases (see Table 6).

This overall high proportion of neutral choices is possibly due to the fact that the stimuli were created at a medium/normal intensity level, without emotional exaggeration, rendering the identification task potentially more difficult. To help researchers evaluate how ambiguous a given recording is, we also provide the full confusion matrix for each stimulus in the database (see Bänziger et al., 2012, supplemental materials, for a similar approach).

Some items were identified at very high accuracy rates by all participants who rated them, and conversely, others were almost never identified correctly. Figure 3 shows the accuracy variance obtained for each stimulus (each sound file in the corpus represents one dot). The boxplots in the top and bottom panels of the figure show the distribution and median accuracy for each emotion (top, A.G.; bottom, K.M.). The figures reveal that a proportion of items (particularly for happiness) fell below the random performance level (i.e., 16.7%)—suggesting that these particular stimuli are ambiguous and not ideal representations of the intended emotion, at least for the participants who rated the stimuli.

Fig. 3
figure 3

Box plot with overlaid dot plots for each emotion’s identification accuracy. Each dot represents one stimulus (i.e., one sound file in the corpus). Horizontal lines represent the medians, boxes show the interquartile range (IQR) representing 50% of the cases, and whisker bars extend to 1.5 times the IQR. (Top) A.G. stimuli. (Bottom) K.M. stimuli

We also obtained confidence ratings for each stimulus rated (i.e., how confident the participants were in their choices). Figure 4 shows the correlations (Pearson’s r) between accuracy of identification and the confidence ratings of the participants for each emotion and each speaker separately (see top of each panel of Figure 4). Only one relationship (neutral for A.G.) was not significant.

Fig. 4
figure 4

Correlations between the mean identification accuracy for each stimulus and the mean confidence ratings by speaker (top panels, A.G.; bottom panels, K.M.)

Discussion

The goal of this project was to create a corpus of auditory pseudo-words uttered in different emotions. The corpus includes 73 controlled audio pseudo-words uttered by two actresses in five different emotions (i.e., happiness, sadness, fear, anger and disgust) and in a neutral tone, yielding at least 876 stimuli per actress. In addition, the pseudo-words are based on the pronunciation rules of North American English, and they are not caricatures or exaggerations of the emotions portrayed. Each recording has been validated by native English listeners in terms of recognition accuracy of the intended emotion portrayal. Overall, the emotions were recognized at accuracy levels that were clearly higher than chance (M = 45% across emotions and speakers, for a chance level at about 16%). The recognition proportions obtained for our data were most accurate for fear, neutral, and sadness, and least accurate for happiness and disgust, consistent with previous data from other languages (Banse & Scherer, 1996; Castro & Lima, 2010; Liu & Pell, 2012; Pell et al., 2009; Scherer et al., 1991; van Bezooijen, 1984). The one exception is anger. In our stimuli, anger was recognized with surprisingly low accuracy (38%). It is often among the best recognized emotions (e.g., Bänziger et al., 2012; Scherer et al., 2011; Wendt & Scheich, 2002). This effect was possibly due to the fact that our pseudo-words were produced with a medium/normal emotional intensity level, possibly making them more confusable with neutral stimuli. Indeed, for both speakers (and particularly for K.M.), anger was most often confused with neutrality. The resulting accuracy in our dataset was globally similar to the levels reported in previous studies on vocal emotion (hovering in the 40%–60% range; see Scherer et al., 2011), in particular among the studies that used similar stimuli (words or short sentences, such as Rigoulot et al., 2013) and a similar number of response options.

The audio stimuli were created as high-quality recordings in the .wav format, which allows experimenters to run more detailed acoustic analyses in order to match stimuli for specific experimental purposes. For instance, intensity (as loudness, in decibels) and duration measurements (in milliseconds) are provided in the corpus database, but other acoustic parameters can be extracted, such that matched stimuli could be selected for the needs of an EEG study, for instance.

The corpus is available at https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm. The website prvides basic information about the corpus and how to request access to the sound files and the database. For each item, recognition accuracy and confusion patterns, as well as speaker, filename, and a number of acoustic details, are provided in an accompanying database, in order to allow researchers to select items specifically for their needs. The list of attributes provided for each sound file in the corpus is detailed in the Appendix.

A number of methodological issues need to be considered. First, the validation of the stimuli was based on data collected in a laboratory setting using a forced choice methodology with six response alternatives. Even though this methodology is commonly used across studies, its ecological validity for real-time interactions in social situations remains limited. It is unclear to what extent these results would generalize to real-life situations outside the laboratory, or to experimental paradigms in which a given stimulus was presented only once without any available “categorization labels,” because forced choice procedures produce better performance than free-choice tests (see Bachorowski & Owren, 2008).

Second, similar considerations are involved with the specific linguistic context in which an emotion is heard and the type of linguistic materials used. Hearing a short (two-syllable) pseudo-word in order to identify an emotion is likely much more difficult than identifying it via a longer, meaningful sentence (see Rigoulot et al., 2013), and is likely to lead overall to lower recognition accuracy. Similarly, medium/normal emotional intensity (as opposed to high, such as in affect bursts) is likely to make emotion recognition less straightforward. Taken together, the identification accuracy we obtained in our study was the product of the forced choice methodology, as well as of the medium/normal intensity of the stimuli, the fact that they are pseudo-words presented in isolation, and the context-free format of their presentation in the recognition task.

Third, the corpus includes a limited set of emotions (i.e., happiness, sadness, fear, anger, and disgust) and a neutral tone. Other emotions could have been included (i.e., surprise and contempt). We selected the emotions to be included in the corpus on the basis of the basic emotions identified by Ekman (1992) and of whether they can have either a positive or a negative valence. Therefore, surprise was not included, because it can have any valence (it can be neutral, positive, or negative), and also because this emotion can be difficult to simulate in the laboratory (Pell et al., 2009). Researchers and clinicians should then consider this limitation when selecting this corpus for their work, as well as the fact that only one positive emotion (i.e., happiness) is included, which would impede systematic analyses of valence effects and the examination of different positive emotions.

Fourth, resarchers should also consider that the corpus contains pseudo-words uttered by two females (i.e., it does not include male voices). Finally, because the validation of the stimuli was based on a between-subjects design (i.e., each participant rated the pseudo-words from one actress only), it is hard to establish differences in the validation between the two speakers. Although this design could be seen as a limitation, due to logistics, the information we provide in the corpus should enable researchers and clinicians to make informed decisions as to what stimuli to select for their work.

Potential applications of the Hoosier Vocal Emotions Corpus

The Hoosier Vocal Emotions Corpus was specifically developed for the requirements of EEG research on emotion processing. The stimuli from this corpus were first used in a study on the neural responses (using EEG techniques) to vocal emotion processing and their associations with temperamental traits and behavioral problems in young children (Hoyniak et al., 2018). The corpus has unique characteristics that are useful for experimental paradigms requiring controlled stimuli (e.g., EEG or fMRI studies)—namely, they are disyllabic pseudo-words (i.e., short stimuli without a semantic meaning) that are overall similar in terms of duration and loudness, and that represent medium/normal emotional intensity.

To the best of our knowledge, the Magdeburger Prosodie Korpus (Wendt et al., 2003; Wendt & Scheich, 2002) is the only other corpus that includes isolated disyllabic pseudo-words. However, this corpus is composed of stimuli that respect the phonotactic and phonetic rules of the German language. Although there are data suggesting that emotions can be recognized across languages and cultures, there is still an in-group advantage in the processing of emotional vocalizations (Sauter et al., 2010). We therefore developed new emotional vocalizations based on the phonology and pronunciation rules of North American English, for research and clinical work requiring English-based stimuli. The use of the corpus does not need to be limited to English speakers, however. For instance, studies of emotion or prosodic processing in monolingual or in multilingual individuals, or in nonnative English speakers, could be easily conducted using stimuli from this corpus (e.g., Dewaele, 2004; Min & Schirmer, 2011; Paulmann & Uskul, 2014).

Stimuli from the corpus could also be used to investigate emotion processing in individuals with certain temperamental or behavioral characteristics associated with difficulties in emotion recognition (e.g., individuals with psychopathic traits or alexithymia). In addition, the stimuli could be used to study the extent to which patients with aphasia, schizophrenia, or other mental disorders (e.g., depression) are able to process prosodic/vocal emotion information.

The Hoosier Vocal Emotions Corpus’s short, disyllabic pseudo-words, which are acoustically more homogeneous than longer sentences, can also be useful to researchers performing acoustic analyses. Investigations that seek to characterize the prosodic and acoustic features of different emotions would benefit from this kind of tightly controlled and not exaggerated materials, since they can help isolate specific acoustic parameters for emotion recognition more precisely. Also, the fact that our stimuli were produced with normal emotional intensity (as opposed to high, such as in affect bursts) contributes to creating more ambiguity in the corpus and makes emotion recognition not only less straightforward, but possibly also more ecologically valid. Ambiguous or subtle acoustic characteristics can be studied with a corpus like ours, that preserves this variability, and because we provide the full confusion matrix for each stimulus, researchers seeking to determine the acoustic parameters of various emotions will have a large range of clear, ambiguous, and misclassified stimuli to choose from. This variability and the range of stimulus uncertainty could also be very useful for the field of automatic emotion recognition. Training paradigms would thus be able first to use the nonambiguous stimuli (see Brendel, Zaccarelli, Schuller, & Devillers, 2010) and progressively to incorporate more subtle stimuli, ultimately leading to robust recognition scores.

Finally, the neutral-tone stimuli can be used on their own for research applications other than emotional processing. For instance, they could be used for pseudo-word or voice recognition tasks in investigations of individual differences in auditory, phonetic, or phonological processing or learning.

Conclusion

In this article, we have presented the Hoosier Vocal Emotions Corpus, a set of controlled disyllabic pseudo-words spoken in five basic emotions and in a neutral tone. This corpus is one of the few databases of pseudo-word vocal expressions for North American English. The corpus consists of 1,763 high-definition audio recordings by two female speakers at a medium/normal emotional intensity level. The validation of the corpus with a forced choice recognition paradigm revealed high rates of emotional validity. The recognition accuracy for each item and the full confusion matrix are provided in an accompanying database, which will allow researchers to explore the full range of stimulus uncertainty. Despite some of the limitations discussed above, this corpus presents a valuable resource for a wide variety of researchers and clinicians.