Introduction

One defining criterion of human language is its vast expressive power as opposed to the semantically limited vocal repertoires of non-human animal species (Chomsky 1965; Jackendoff 1999; Nowak et al. 2000). The core of the generative power of human languages lies in our ability to flexibly combine acoustic units under specific rules into structurally highly complex and semantically rich utterances. This is achieved through two main syntactic principles. First, via phonological syntax, we construct meaningful words out of phonemes that are not themselves meaningful. Second, via lexical syntax, we generate compositional messages, i.e., phrases and sentences, to which each constituent word contributes its own independent meaning (e.g., Marler 1977; Jackendoff 1999; Hurford 2011). Finding the evolutionary origins of the syntactic capacity of humans is one of the challenges in order to further our understanding of the origins of human language, which up to date remains highly disputed (Hauser et al. 2002).

Syntactic variation appears to be an evolutionarily deeply rooted principle of how information is conveyed, with compound vocalizations being notably widespread among animal species ranging from birds to mammals. Marler (1977) recognized two types of syntax, analogous to phonological and lexical syntax in human language. Phonological syntax, defined as the concatenation of sounds without independent information content and which are not used singularly, or meaningful sounds that lose their original information content when combined (Marler 1977), appears to be widespread (e.g., birds: Marler and Mitani 1988; Hylobates agilis: Mitani and Marler 1989; Pteronotus parnellii: Kanwal et al. 1994; Procavia capensis: Kershenbaum et al. 2012). In contrast, evidence for lexical syntax, through which the meaning of compound vocalizations arises from the independent meaning of each of its parts (Marler 1977), seems to be confined to primate species. These compound vocalizations occur in situations that comprise the contexts of each of their units used singularly. Therefore, these vocalizations appear to be compositional messages since their information content seems to reflect the information content of each of their units. On the one hand, such compound calls are given in situations in which the contexts of component calls are co-occurring, in that the information contents of the component calls are combined in a purely additive fashion (e.g., pant hoot and grunt combinations in Pan troglodytes: Crockford and Boesch 2005; close call combinations in Cercopithecus diana: Candiotti et al. 2012). On the other hand, lexical compound calls appear to act as contextual modifiers as they are produced in intermediary, intensified, or alleviated contexts compared to the contexts they share with their constituent calls. For instance, call combinations occur only in the contexts in which the range of contexts of the constituent calls overlaps (Cebus olivaceus: Robinson 1984); alarm call combinations are given in alarm situations that require vigilance but with reduced fear (Saguinus oedipus: Cleveland and Snowdon 1982; Cercopithecus diana: Zuberbühler 2002; Ouattara et al. 2009); and food call combinations appear to be given in food contexts with increased arousal state (Pan troglodytes: Crockford and Boesch 2005, but see Schassburger 1993 for threat call combinations in wolves functioning as intensified threat signals).

To date, research on syntactic variation in primate vocalizations was mainly focused on monkeys and specifically their predator alarm call systems (e.g., Cercopithecus diana: Zuberbühler 2002; Cercopithecus campbelli: Ouattara et al. 2009). Remarkably, little research has considered syntactic structures in vocalizations of great apes (but see Crockford and Boesch 2005; Clay and Zuberbühler 2011; Lameira et al. 2013), which are, because of their close phylogenetic proximity to humans, particularly relevant to our understanding of the evolution of the syntactic capacity of humans. Furthermore, given the limited acoustic and contextual variability of alarm calls, the importance of predation pressure as driving force on the selection toward syntactic communication in the human lineage is currently debated (e.g., Stephan and Zuberbühler 2008; Lameira et al. 2013). Studying the syntactic properties of primate close-distance vocalizations, which regulate many aspects of primate social behavior (e.g., Boinski 1993; Cheney et al. 1995; Boinski and Campbell 1996; Silk et al. 2000; Whitham et al. 2007), seems adequate given the inherently social nature of human language. Close-distance vocalizations are acoustically highly variable (Elowson and Snowdon 1994; Sugiura 2007; Lemasson and Hausberger 2011; Bouchet et al. 2012), which has been suggested to be beneficial in the face of variable influences of numerous social factors (e.g., Snowdon 1997; Bouchet et al. 2012). A recent study suggests that such flexibility in close range social calls can arise through combinatorial systems (e.g., Candiotti et al. 2012).

Gorillas are an excellent great ape model to study syntactic variation in close-distance vocalizations, as so-called close calls are characteristic for gorilla vocal behavior (e.g., Harcourt et al. 1993; Harcourt and Stewart 1996; Salmi et al. 2013). They form a group of intergraded vocalizations of relatively low pitch, ranging from short grunts to longer grumbles and hums (Schaller 1963; Fossey 1972; Harcourt et al. 1986, 1993; Seyfarth et al. 1994; Salmi et al. 2013). Gorillas give close calls usually as part of vocal exchanges and most frequently when individuals are in close proximity, including in potentially agonistic situations when feeding close to each other (Harcourt et al. 1986, 1993). However, close calls are also emitted during affiliative interactions when individuals are resting, during grooming, and during interactions with infants, as well as in situations of potential separation when individuals are far from others and before changes in group activity (Harcourt et al. 1986, 1993; Seyfarth et al. 1994; Salmi et al. 2013). Based on this, various functions for the different gorilla close calls have been suggested, ranging from facilitating friendly interactions or appeasement over maintenance of group cohesion to mildly agonistic signals mediating feeding competition (Schaller 1963; Fossey 1972; Harcourt et al. 1993; Seyfarth et al. 1994; Stewart and Harcourt 1994; Harcourt and Stewart 1996, 2001). Formal testing of these functional hypotheses on a detailed acoustic level will be necessary for a better understanding of the gorilla close call system. Moreover, gorillas live in a wide variety of ecological conditions that should lead to variation in social and hence also vocal behavior (Robbins 2010). However, most of what we know about gorilla vocal behavior derives from the well-studied mountain gorillas, whereas our knowledge about western gorilla vocal behavior is still limited. A recent study by Salmi et al. (2013) suggested contextual differences in the use of close calls in the two gorilla species. However, directly comparable data that allow linking acoustic to contextual variation in the two gorilla species have so far not been available.

The purpose of this study was to investigate contextual correlates of syntactic variation in close calls of mountain and western gorillas. In a previous study, Hedwig et al. (2014) have demonstrated that gorilla close calls consist of five acoustic unit types: low-pitched short atonal grunts (in the following referred to as A1) and tonal grunts (T2), longer grumbles (T4), as well as high-pitched short hums (T1) and long hums (T3). The gorillas used these unit types singularly or combined into a total of more than 150 combinations, which were produced in a flexible but non-random manner (Hedwig et al. 2014), similar to other primate species’ syntactic systems (e.g., Saguinus oedipus: Cleveland and Snowdon 1982; Cebus olivaceus: Robinson 1984; Pan troglodytes: Crockford and Boesch 2005). The most frequent combinations were the low-pitched unit types A1, T2, and T4, either used singularly or as pairwise combinations, which were used by all age–sex classes in both the western and mountain gorilla group investigated. These calls are equivalent to what has previously been termed as single grunt, double grunts, and grumbles, which are the vocalizations most frequently emitted by gorillas (Harcourt et al. 1993; Salmi et al. 2013). Given the frequent use of these unit types and their combinations, they seem to play a particularly important role in gorilla social behavior and hence were the focus of our study.

First, we investigated whether gorilla close calls provide indications for lexical or phonological syntax. Overall, since in syntactic systems the information content of a combination should be different from the information content of unit types used alone, we expected contextual differences between unit types used alone and in combination with other unit types (e.g., Cleveland and Snowdon 1982; Robinson 1984; Crockford and Boesch 2005; Candiotti et al. 2012). If close calls represent a form of lexical syntax, we predicted them to be compositional, in the sense that the information content of a combination will reflect the information content of its constituent unit types used singularly. This would be indicated by an overlap between the contexts of a combination and each of its constituent unit types used singularly. Moreover, lexical syntax will be indicated by a correlation between the context and the variation in the sequential order of unit types within a call (e.g., “hunt chimpanzee” is different from “chimpanzee hunt,” Crockford and Boesch 2005). In turn, if close calls constitute a form of phonological syntax, the information content of a combination should not reflect the information content of its constituent unit types used singularly. This should be indicated by combinations occurring in contexts different from the context of their unit types used singularly, including combinations comprising their repetition. Compositionality (the meaning of a sentence can be derived from the meanings of its parts) constitutes the hallmark feature underlying lexical syntax in human speech (e.g., Portner 2005). This principle of compositionality is violated by human language in rare cases as, for instance, the figurative meaning of an idiomatic phrase cannot be deduced from the meaning of its parts (e.g., “to beat a dead horse”). However, assuming that such complexity evolved gradually, we would expect not the same level of sophistication in gorilla close call syntax.

Second, we investigated possible functions of gorilla close calls as suggested in previous studies (Schaller 1963; Fossey 1972; Harcourt et al. 1993; Seyfarth et al. 1994; Stewart and Harcourt 1994; Harcourt and Stewart 1996, 2001). We predicted that calls functioning to signal a caller’s location will be more likely during foraging, when individuals are on the move, and when individuals are far from other group members. Calls involved in facilitating friendly social interactions will be more likely when individuals are resting in close proximity to others, which is when affiliative social interactions are most likely, and particularly in close proximity to an infant, since close-distance calls triggered by the presence of infants have been suggested to serve as signals of benign intent in other primate species (e.g., Papio cynocephalus ursinus: Cheney et al. 1995; Macaca mulatta: Whitham et al. 2007). Moreover, calls functioning to mediate feeding competition should be given when animals are feeding in close proximity to each other. Lastly, we tested for contextual differences between the close calls of the two gorillas species (Salmi et al. 2013).

Methods

Study sites and subjects

Data collection on mountain gorillas focused on 10 adult individuals (one silverback, five females, and four blackbacks) of the habituated “Kyagurilo” group (Robbins 2008) in Bwindi Impenetrable National Park, Uganda. Due to the regulations of the Uganda Wildlife Authority, observations were restricted to approximately 4 h per day. Data collection was conducted usually in the mornings, on 312 days during a total of 12 months from October 2007 through October 2008. For western gorillas, data were collected on 5 adult individuals (one silverback, three females, and one blackback) of the “Makumba” group (Masi et al. 2009) observed at the Bai Hokou study site located in the Dzanga-Ndoki National Park (Dzanga Sangha Protected Areas), Central African Republic. Data were collected in the mornings (7:00–12:00) and/or afternoons (12:00–17:00) on 124 days over a period of 8 months from April to November 2009.

Data collection

We used focal animal sampling (Altmann 1974) during which we conducted continuous audio recordings of the focal animal’s vocal behavior. An average of 33 h of audio recordings was collected for each adult mountain gorilla individual (range 32.6–34.6) and an average of 58 h for each adult western gorilla individual (range 57–58.5). For each vocalization, we recorded the following data describing the immediate context of call emission: (1) the activity of the focal animal, which we categorized as resting (sleeping, sitting, grooming, nursing) or foraging (preparation and ingestion of food, locomotion between feeding spots); (2) whether there was a weaned or infant (non-weaned) individual within 5 m of the caller (y/n); and (3) whether the vocalization was given spontaneously or in reply to another individual’s call. A reply was defined as any vocalization given within 3 s after another individual’s call (based on Seyfarth et al. 1994).

Statistical analysis

The analysis was based on a subset of calls for which we demonstrated the syntactic properties of gorilla close calls (Hedwig et al. 2014). Our study focused on the most commonly observed unit types, the atonal grunt A1, the tonal grunt T2 and the grumble T4, and their most common combinations since these seem to play a particularly important role in gorilla social behavior given their frequent use (Fig. 1). These combinations were observed to be given by a minimum of 9 of the 15 sampled individuals with individual call rates ranging from 0 to 11.9 calls per hour (A1: average call rate = 0.4 (0–1.7), observed in 12 individuals; T2: 0.6 (0–5.2), 9 individuals; T4: 2.7 (0–11.9), 13 individuals; A1_A1: 1 (0–6.2), 12 individuals; T2_T2: 0.7 (0–3.5), 9 individuals; A1_T2: 0.2 (0–1.2), 11 individuals; T2_A1: 0.5 (0–2.2), 13 individuals; A1_T4: 0.9 (0–2.5), 14 individuals; T2_T4: 1.4 (0–3.6), 14 individuals). Among these, we selected the calls for which we had all necessary contextual information and a sufficient number of observations to allow for a meaningful statistical test of our hypotheses. A total of 1321 calls were considered for analysis (see Table 1 for additional information on the number of individuals and calls used for each analysis). We ran four sets of analyses. We tested for contextual differences between (1) the unit types A1, T2, and T4 used singularly; (2) A1, T2, and T4 used singularly and their respective combinations; (3) combinations comprising unit types A1 and T2 in reversed order; and (4) unit types A1 and T2 used singularly and in combinations comprising repetitions of either A1 or T2 (Fig. 2; Table 1).

Fig. 1
figure 1

Spectrograms of representative examples of the unit types A1, T2, and T4 (all western gorilla adult female), as well as their combinations T2_A1 (mountain gorilla adult female) and T2_T4 (mountain gorilla blackback male). Atonal grunt A1: average duration 0.32 s (range: 0.03–2.17), average maximum peak frequency 184 Hz (74–384); tonal grunt T2: average duration 0.22 s (range: 0.03–0.63), average mean fundamental frequency 40 Hz (13–82); grumble T4: average duration 1.34 s (0.34–5.01), average mean fundamental frequency 31 Hz (11–65) (from Hedwig et al. 2014)

Fig. 2
figure 2

Overview of comparisons made in this study. Different comparisons are indicated by the respective connecting lines. Gray circles indicate unit types used singularly and boxes depict their respective combinations. Sample sizes did not allow to conduct all possible comparisons between the illustrated combinations. Thin lines indicate nonsignificant comparisons, bold lines significant comparisons (Table 1). Comparisons were carried out using generalized linear mixed models (see text for more details)

Table 1 Overview of models run to compare the different unit types and their combinations, the respective sample sizes, and the results of likelihood ratio tests comparing full models including all predictor variables with reduced models comprising only the autocorrelation term and the random effect of “caller”

We used generalized linear mixed models with binomial error distribution and a logit link function (GLMM; Baayen 2008) for the binary response variable coding one of the combinations to be compared as “1” and the other one as “0.” Models were fitted using the function “glmer” provided by the lme4 package (Bates et al. 2013) in the statistical software environment R, version 3.0.2 (R Core Team 2013). Into the models, we included context variables as fixed effects and “caller” as a random effect. To differentiate between competitive situations (foraging in close proximity to another individual) and socially relaxed, potentially affiliative situations (resting in close proximity to another individual), we included an interaction between “individual within 5 m” and “activity.” We expanded this by specifically including a three-way interaction between “individual within 5 m,” “activity,” and “group” to test for contextual difference in the two study groups, as suggested by Salmi et al. (2013). Lastly, we included an autocorrelation term to account for temporal interdependence between data points (Table 2). A significant autocorrelation term indicates that observations made temporally close to one another are similar (see Online Resource 1 for detailed information on how the autocorrelation term was calculated).

Table 2 Fixed and random effects included into models used to compare the contexts of the different unit types A1, T2 and T4, and their combinations

For all models, we tested for collinearity between the predictor variables by calculating variance inflation factors (VIF, Field 2005), using the function vif of the R package “car” (Fox and Weisberg 2011) based on a standard linear model excluding the random effect (R function “lm”). We found a maximum VIF = 2.03 across models and predictors, which is considered to be unproblematic (Field 2005).

As an overall test of the influence of our context variables on the probabilities of the combinations, we compared the full model with all predictor variables, to a null model only including the autocorrelation term and the random effects variable “caller” (Forstmeier and Schielzeth 2011) using a likelihood ratio test (Dobson 2002) and the R function “anova”. In case the full-null model comparison revealed significance and the interactions did not, we excluded them (beginning with the three-way interaction) to obtain interpretable p values for the respective main effects or two-way interactions.

To rule out that certain individuals were particularly influential for the results, we removed individuals one at a time from the data, ran the model again, and compared the estimated coefficients with those derived from the data set including all individuals. Across models, estimates derived from data sets with individuals removed in part differed considerably from estimates derived from models based on all data. This was particularly the case for predictors being nonsignificant in the model based on all data. From this, we concluded our nonsignificant findings need to be treated somewhat cautiously (i.e., effects could actually be stronger than our models suggest). However, we did not detect any indication that our data sets contained individuals that could destabilize the estimates of significant predictor variables.

Due to complete separation, i.e., the response being completely explained by a specific predictor (Field 2005), model estimates could sometimes not reliably be calculated (indicated by heavily inflated estimates with extremely large standard errors for some of the predictors). In order to derive interpretable estimates for the respective predictors, the problematic cases were identified and the following randomization method was applied. We reran a given model a number of times corresponding to the number of problematic cases (the number of times the response always being “0” in a specific context), whereby for each model we exchanged one of the “0” cases with a “1.” Model estimates were then derived as the mean of the estimates derived from each of these models. Note that this approach is conservative in that it weakens the effect of the respective predictor on the response.

Following Barr et al. (2013), we aimed to include random intercepts and random slopes for each predictor variable to model random variation in the mean response and the strength of the effect of the predictors among the different individuals. However, complete separation of the response variable for the levels of the predictor variable in the majority of individuals and models made it impossible to include random slopes for all the predictor variables simultaneously. Therefore, we first ran all models with only the random intercept of “caller.” To avoid model instability, we ran models only including individuals for which at least one observation for both of the combinations to be compared was available. Subsequently, we reran the models with random slopes for the significant predictor variables. In those cases, we present the results of the model without random slopes for the nonsignificant predictors and for the significant predictors the results of the model including its random slope. This approach seems justified since neglected random slopes lead to p values being too small (Schielzeth and Forstmeier 2009; Barr et al. 2013), so they should not be an issue for nonsignificant predictors.

Results

Unit types used singularly

Overall, we found contextual differences in the probability of occurrence of T4 compared to both A1 and T2, but not between A1 compared to T2 (likelihood ratio tests comparing the full models including all context variables with a model comprising only the autocorrelation term and the random effect; Tables 1, 3). It appeared that T4, compared to T2, was significantly less likely to occur during resting as compared to feeding and in the presence of another weaned individual (Table 4). Furthermore, T4, compared to A1, was less likely to be emitted in the presence of an infant. The influence of the presence of other weaned individuals was different between the two study groups as indicated by a significant interaction between individuals within 5 m and group (Table 5). While in the mountain gorilla group, the presence of other weaned individuals had no obvious effects on the probabilities of T4 and A1, in western gorillas T4 was more likely when no one was present within 5 m (Fig. 3).

Table 3 Summary of results concerning contextual differences between unit types used singularly and their combinations
Table 4 Results of final model comparing the contexts of unit types T4 and T2 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = 0.49, SE = 2.17, p = 0.819), and two-way interactions (activity and individual within 5 m: estimate = −0.86, SE = 0.77, p = 0.266; activity and group: estimate = 2.10, SE = 1.30, p = 0.105; individual within 5 m and group: estimate = −1.66, SE = 1.00, p = 0.097)
Table 5 Results of final model comparing the contexts of unit types T4 and A1 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = −1.74, SE = 1.66, p = 0.294), and the nonsignificant two-way interactions (activity and individual within 5 m: estimate = 0.03, SE = 0.83, p = 0.974; activity and group: estimate = −1.18, SE = 0.88, p = 0.181)
Fig. 3
figure 3

Effect of the presence of weaned individuals within 5 m of the caller (yes/no) on the proportion of T4 as compared to A1 given by the individuals of the two study groups. Black dots represent the individuals of the two groups. While the presence of weaned individuals had no influence on the probabilities of T4 and A1 in mountain gorillas, in individuals of the western gorilla group T4, as opposed to A1, was more likely when no other weaned individual was in close proximity

Comparisons of unit types combined with others and used singularly

Overall, we found contextual differences in the probabilities of unit types used singularly compared to their combinations, but not between the combinations T2_T4 and A1_T4 (likelihood ratio tests, Tables 1, 3). We found that T2_T4, compared to T2, was less likely to be given spontaneously (Table 6). T2_T4, compared to T4, was also less likely to be given spontaneously, more likely during resting and in the presence of an infant within 5 m, and furthermore, less likely to be given by the individuals of the western gorilla group (Table 7). Similarly, A1_T4, compared to T4, was less likely to be given spontaneously, more likely during resting, and less likely to be given by the western gorilla individuals (Table 8). Moreover, A1_T4, compared to A1, was less likely to be given spontaneously (Table 9). Finally, T2_A1 was less likely to be given by the western gorilla individuals compared to T2 (Table 10).

Table 6 Results of final model comparing the contexts of the combination T2_T4 and unit type T2 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = −0.17, SE = 2.24, p = 0.915), and nonsignificant two-way interactions (activity and individual within 5 m: estimate = −1.91, SE = 0.85, p = 0.16; individual within 5 m and group: estimate = −1.58, SE = 1.27, p = 0.214; activity and group: estimate = 1.06, SE = 1.33, p = 0.426)
Table 7 Results of final model comparing the contexts of the combination T2_T4 and unit type T4 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = 0.37, SE = 0.89, p = 0.683), and the nonsignificant two-way interactions (activity and individual within 5 m: estimate = 0.39, SE = 0.44, p = 0.382; individual within 5 m and group: estimate = −0.04, SE = 0.45, p = 0.924; activity and group: estimate = 0.27, SE = 0.45, p = 0.549)
Table 8 Results of final model comparing the contexts of the combination A1_T4 and unit type T4 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = 1.69, SE = 1.20, p = 0.159), and nonsignificant two-way interactions (activity and individual within 5 m: estimate = −0.23, SE = 0.54, p = 0.667; individual within 5 m and group: estimate = −0.23, SE = 0.58, p = 0.699; activity and group: estimate = 0.43, SE = 0.56, p = 0.436)
Table 9 Results of final model comparing the contexts of the combination A1_T4 and unit type A1 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = −0.85, SE = 1.86, p = 0.649), and the nonsignificant two-way interactions (activity and individual within 5 m: estimate = −0.11, SE = 0.90, p = 0.901; individual within 5 m and group: estimate = −1.83, SE = 0.95, p = 0.054; activity and group: estimate = −0.56, SE = 0.92, p = 0.547)
Table 10 Results of final model comparing the contexts of the combination T2_A1 and unit type T2 used singularly after the removal of the nonsignificant three-way interaction between activity, individual within 5 m and group (estimate = −0.69, SE = 2.57, p = 0.786), and nonsignificant two-way interactions (activity and individual within 5 m: estimate = −0.37, SE = 1.05, p = 0.726; individual within 5 m and group: estimate = 0.59, SE = 1.60, p = 0.711; activity and group: estimate = −0.16, SE = 1.46, p = 0.914)

Sequential order and repetitions

Our results indicated no contextual differences in the probability of combinations with reversed order or unit types used singularly compared to combinations containing their repetition (likelihood ratio tests; Tables 1, 3).

Discussion

Previously, Hedwig et al. (2014) demonstrated that gorillas combine acoustic unit types non-randomly. As expected for syntactic systems, here we found that the combinations of unit types occurred in contexts differing from those of unit types used singularly. While the atonal grunt A1 and tonal grunt T2 used singularly appeared to be more likely during resting and in close proximity to other individuals, the grumble T4 used singularly was more likely during foraging and when individuals were alone. The combinations A1_T4 as well as T2_T4 occurred most likely in situations resembling the contexts of A1 and T2 used singularly, but were also more likely to be given as reply calls, whereas all three unit types used singularly where given more likely spontaneously. We found no contextual differences between A1 and T2 used singularly and combinations consisting of repetitions of them or between combinations of them given in the two different orders. This is likely due to the fact that we found no differences between A1 and T2 used singularly with regard to the contexts in which they appeared. It is possible that the distinction between atonal and tonal grunts may not be relevant to the gorillas, or it may require a more detailed analysis of acoustic or contextual variation to detect such effects.

What do these results tell us concerning the three goals we pursued in this study? Our first goal was to investigate whether gorilla close calls provide indications for lexical or phonological syntax. If lexical, combinations should be compositional in the sense that the information content of combinations can be derived from the information content of each of their unit types. Hence, we expected contextual overlap between a combination and each of its unit types used singularly. However, if phonological, the combination would occur in a different context from its unit types. The combinations A1_T4 and T2_T4 occurred in similar contexts to A1 or T2, but not T4. In fact, the combinations occurred in a context different from any of its unit types used singularly. T2 or A1 in combination with T4 was also used during resting and when in close proximity to other individuals, but additionally as reply calls. As such, when in combination with A1 or T2, T4 appeared to have lost its independent information content and acted as a modifier of T2 or A1. Therefore, combining unit types seems to enable gorillas to increase the number of messages that can be conveyed by only a small repertoire of five unit types, which is the key benefit of syntax (Nowak et al. 2000; Hauser et al. 2002; Fitch 2005). However, our results provide little evidence for lexical syntax because combinations did not appear to be compositional as their information content cannot be derived from the information content of each of their components. Hence, it is most parsimonious to conclude that the combinations investigated in this study constitute a form of phonological syntax.

The lack of complete contextual overlap between component calls used singularly and their combinations clearly differentiates our findings from previous examples for rudimentary forms of lexical syntax in animal vocalizations. (1) Our finding of contextual modification does not comply with the contextual modifications put forth as indicators of lexical syntax (e.g., Robinson 1984), in which, for example, alarm call combinations are produced in alleviated alarm contexts (Saguinus oedipus: Cleveland and Snowdon 1982; Cercopithecus diana: Zuberbühler 2002; Ouattara et al. 2009). In contrast, the combinations investigated here were associated with a contextual modification for only one unit type (A1 or T2), whereas the context of the other component (T4) was not reflected in the use of the combination. (2) Our results neither provide an indication that the information content of calls was combined in a purely additive fashion in that compound calls are used in situations when the contexts of component calls are co-occurring (e.g., Pan troglodytes: Crockford and Boesch 2005). Instead, in our study, combining was associated with a context (used as a reply call), which was unspecific to any of the component calls when used singularly. As such, our findings on gorilla close calls differ from the findings of a previous study, which suggests that the close call combinations in Diana monkeys function additively because combinations comprised a call acting as individual identifier and a call providing information about the particular context (Candiotti et al. 2012). Interestingly, our findings also show some deviation from previous examples for phonological syntax. Marler (1977) defined phonological syntax twofold. (1) It encompasses combinations of sounds without independent informational content, as found in the often stereotyped sequences of gibbons (Hylobates agilis: Mitani and Marler 1989). In gorillas, however, all unit types seem to have an independent information content as all of them are used singularly. Moreover, gorilla close call combinations are not produced in a stereotyped fashion but occur in great diversity as they are produced in a flexible but yet non-random way (Hedwig et al. 2014). (2) Marler’s (1977) definition also included the combination of independently used sounds which lose their original information content. However, in the investigated gorilla close call combinations, only one appeared to have lost its original context (T4), while the context of the other unit types (A1, T2) was reflected by the contextual use of the compound call. However, the phonological syntax we suggest here to underlie the combining of independently used unit types in gorillas may in fact provide an intriguing analogy to linguistic phenomena since also in human language independently meaningful words used to construct sentences (lexical syntax) are frequently used as syllables to construct new words with new meaning (phonological syntax), with their original information content either being lost (e.g., “to prod” in “prod-igy”) or maintained (e.g., “egg” in "egg-nog").

Our study investigated the contextual use of only a few combinations within the large syntactic variation in the gorilla close calls (see Hedwig et al. 2014). In order to fully comprehend the functioning of gorilla syntax, two main suggestions arise for further studies. (1) More combinations need to be investigated in order to gain a contextual description of syntactic variation as comprehensive as available for chimpanzees (Crockford and Boesch 2005). In chimpanzees, evidence for lexical syntax was found for some but not all call combinations (Crockford and Boesch 2005), which may be not unexpected assuming an evolutionary scenario in which lexical syntax gradually became a part of communication within the human lineage (e.g., Jackendoff 1999). (2) Based on such thorough contextual analyses, playback experiments need to be designed to explore the meaning of calls and their combinations by investigating how listeners react to specific single calls and how this changes when calls are combined. Putty nosed monkeys, for example, seem to perceive their compound calls holistically, not compositionally (e.g., Arnold and Zuberbühler 2012).

Our second goal was to investigate possible functions of the gorilla close calls. We found that the investigated combinations occurred in two very different overall contexts, suggesting different functions. T4 appeared to be particularly emitted when foraging and when alone, both situations in which the coordination of group movements is likely to be crucial, either to cohere group members to a specific feeding site or maintain spacing while feeding, as well as to maintain contact when group members are spread out. As such, following our predictions, these results suggest that the grumble T4 may act as a signal of location. The acoustic structure of T4 appears particularly appropriate for a signal of location since its longer duration compared to T2 and A1 increases its detectability (Klump and Maier 1990; Nemeth et al. 2006). However, future studies are needed to explore the possibility that the acoustic variation within and between gorilla call combinations is adaptive in order to enhance signal transmission (Hedwig et al. in prep). In contrast, T2, A1 as well as A1_T4 and T2_T4 seemed more likely to occur when in close proximity to other individuals as well as during resting, when affiliative social interactions usually take place. The presence of an infant being an apparent trigger for the emission of these calls indicated a benign state of the callers, suggesting a role in facilitating affiliative social interactions, similar to the close-distance calls of various other primate species (Saimiri sciureus: Biben et al. 1986; Macaca fuscata: Masataka 1989; Macaca arctoides: Bauers and de Waal 1991; Bauers 1993; Papio cynocephalus ursinus: Cheney et al. 1995; Macaca mulatta: Whitham et al. 2007). Overall, our findings regarding the function of these calls are based on a rather crude contextual analysis, and more a detailed analysis is necessary. Moreover, considering that silverbacks play an important role in the gorilla social system, it is possible that the same vocalization might serve slightly different functions depending on the sex of the caller. As stated above, playback experiments investigating how listeners react to calls are ultimately needed in order to fully understand the function of these calls.

One finding of our study was that the combinations A1_T4 and T2_T4 were especially likely to be given as reply calls. This is particularly interesting since vocal exchanges constitute a significant aspect of gorilla vocal behavior (e.g., Harcourt et al. 1993; Harcourt and Stewart 1996). Such temporarily associated calling behavior is common among primate species and has mainly been described for long- and short-distance contact calls, which may signal identity and location when out of visual contact (e.g., Cebuella pygmaea: Snowdon and Cleveland 1984; Saimiri sciureus: Biben et al. 1986; Macaca fuscata: Mitani 1986; Saguinus oedipus: Ghazanfar et al. 2002). Acoustic modification of reply contact calls appears to be infrequent and has been suggested to confirm the reception of the signal to a specific caller and allow for a directional exchange of information, enhance locatability of the response call (Macaca fuscata: Sugiura 1993, 1998), or to terminate the exchange of contact calls (Saimiri sciureus: Biben et al. 1986). Corroborating previous studies, our results suggest that the employment of acoustically distinct reply calls may be a relevant component of gorilla vocal behavior, which may go beyond exchanging information about the group members’ locations. In this study, the reply call was most likely given during resting and when being close to other group members, which supports our hypothesis that these calls serve to coordinate affiliative interactions. A previous study reported acoustically distinct variants of double grunts in mountain gorillas used either as reply or spontaneously (Seyfarth et al. 1994), which may function to coordinate competitive interactions or the initiation of group movements (Seyfarth et al. 1994; Stewart and Harcourt 1994). Moreover, western gorillas reply with numerous close calls to a long-distance contact call (i.e., hoot series; Salmi and Doran-Sheehy 2014; Hedwig unpublished data). Given the increasing number of studies revealing high degrees of social cognition in ape vocal behavior (e.g., Pan troglodytes: Crockford et al. 2004; Slocombe and Zuberbühler 2005; Crockford et al. 2012; Gruber and Zuberbühler 2013; Schel et al. 2013; Pan paniscus: Clay and Zuberbühler 2011; Pongo spec.: Wich et al. 2012, but see Owren et al. 2011), the employment of such acoustically different reply calls with perhaps different informational content in gorillas may reflect a particularly flexible and expandable comprehension of vocal signals in great apes (e.g., Seyfarth and Cheney 2010). We believe that the analysis of the outcome of vocal exchanges (see Harcourt and Stewart 1996) on a detailed acoustic and contextual level will be key to testing functional hypotheses generated by more general contextual studies like the one presented here.

Additionally, we aimed to test for contextual differences in the same vocalizations used by individuals of the mountain and western gorilla groups. In both groups, the presence of an infant had a significant positive effect on the emission of A1, as opposed to T4, whereas the presence of a weaned individual only had a positive effect on the use of A1 as opposed to T4 in the western gorilla group. This is particularly interesting since western gorillas seem to show larger group spread compared to mountain gorillas (e.g., Doran and McNeilage 2001; Goldsmith 2003; Bermejo 2004; Hedwig unpublished data). Therefore, the selective pressure on the evolution of signals of location, such as our results suggest for T4, may have been particularly strong for western gorillas. As such, our findings support a previous study by Salmi et al. (2013) suggesting contextual differences in the use of close calls in mountain and western gorillas. However, Salmi et al.’s (2013) findings indicate a more context-specific use of vocalizations in western gorillas compared to mountain gorillas due to the more restricted visibility in their lowland rainforest habitat. In contrast, our results suggest that due to their larger group spread, western gorillas may use T4 as a general signal of location as it was elicited by the absence of both infants and weaned individuals, whereas in the cohesive mountain gorillas, T4 may function as a signal of location specifically triggered by the absence of an infant. However, species differences should be inferred with caution given that our comparison was based on only two groups and a small number of individuals. Further comparative studies (based on more social groups) investigating the contextual correlates of the syntactic variation within and between species need to be carried out in order to reliably assign differences in the vocal behavior to differences in the two species’ social behavior (e.g., Doran and McNeilage 2001) or habitat characteristics (e.g., Nkurunungi et al. 2005).

In conclusion, we investigated contextual correlates of the syntactic variation within gorilla close calls, which previously appeared rather enigmatic due to apparently unspecific contextual use (e.g., Harcourt et al. 1993; Salmi et al. 2013). However, by using a contextually rather crude level of analysis, we were able to elucidate potential functions of different gorilla close calls, suggesting that the syntactic approach followed here may be a fruitful framework for future studies. Based on our results, we hypothesize that the combination of acoustic units into compound vocalizations by gorillas constitutes a form of phonological syntax, which may enable them to increase the number of messages that can be transmitted by an otherwise small repertoire of acoustic units. However, future studies are needed to test this hypothesis by including more combinations into the analysis and particularly using playback experiments investigating how gorillas perceive the unit types used singularly and when combined. Together with the few previous studies investigating syntactic structures in ape vocal communication (Crockford and Boesch 2005; Clay and Zuberbühler 2011; Lameira et al. 2013), our results provide a baseline for future studies aiming to achieve a more comprehensive understanding of great ape vocal communication and their syntactic capabilities, which is crucial to our understanding of the evolutionary origins of human language.