Introduction

Language evolution is a fascinating puzzle—or more accurately, series of puzzles—requiring a piecemeal and integrative approach. The puzzle piece that I consider in this article has become known as the (hypothesised) gesture-to-vocal transition. The road ahead is as follows. In the next section, “Protolanguage”, I consider the leading hypotheses about the nature/origins of protolanguage.Footnote 1 I argue for a largely gestural conception of protolanguage, though that is not to say that vocalizations were absent from the picture.Footnote 2 Even so, theorists that posit that the origins of language are (largely) gestural, as I do, owe an explanation of the shift from a (largely) manual form of protolanguage to full speech (“Finding Voice?”). I consider gestural theorist Michael Corballis’s most recent explanation of such a shift.Footnote 3 I critique aspects of Corballis’s account (“Corballis tackles the gesture-speech transition”), and present my alternative, bringing musicality into the story (“Musicking across the gap”). Note that even if one does not accept the gesture-dominant conception of protolanguage, considerations of vocal musicality are still relevant in explaining how hominins shifted from having fairly non-flexible, chimpanzee-like vocal communication to a flexible vocal protolanguage (perhaps heavily assisted with gestures). Finally I identify some priorities for future research, including echo phonology and prospects for its import to evolutionary theorizing (“Towards arbitrariness: the echo phonology hypothesis”), and then sum up.

Protolanguage

In this section I consider three leading hypotheses about the origins and nature of hominin protolanguage. The first amounts to a musical conception of protolanguage, sometimes called a musilanguage (Darwin 1871; Brown 2000; Mithen 2005; Fitch 2010; Lawson 2014). The idea is that music and language share a single common ancestor—a vocal musical protolanguage—although the specifics differ from advocate to advocate. On this view, protolanguage in the ancestral hominin lineage is discontinuous with nonhominin great ape intentional communication (great apes do not sing), and is instead an example of convergence (e.g., with gibbon song, whale song, bird song). The protosyntax of musical protolanguage is generally considered by advocates of the hypothesis to be holistic—consisting of complete and unanalysable phrases with a “whole message”,Footnote 4 which fractionate over time as syntax evolves.

According to the second hypothesis, although protolanguage is vocal, it is not “musical” but word-like or “lexical”. On this view, protolanguage resembles not animal song, but a referentially displaced version of typical primate vocalization, for example the much-discussed vervet monkey alarm calls (three distinct vocal calls for three different predators/evasion strategies). Our hominin ancestors, one version of the story goes, gained more and more complex mental representation systems until they could divide the world up into “word-size” pieces (Bickerton 1990). Once that occurred, protolanguage “popped out” as the pieces received lexical labels. This enabled basic protolinguistic communication: ‘in protolanguage, the speaker thought of a word and then transmitted it directly to the organs of speech, then the next, and the next, without linking them in the brain prior to utterance’ (Bickerton 2009, p. 232).

However, voice-dominant hypotheses about protolanguage are not well supported. Primate vocalization and human language are handled by different brain areas. Homologues of Wernicke’s and Broca’s areas (crucial to language processing in humans) exist in primates but don’t handle vocalization (but rather gesture; see Rizzolatti and Arbib 1998; Kohler et al. 2002). Neocortical neural structures in general are not used in primate vocalization, which appears to be controlled by the limbic system. Primate vocalization is largely involuntary.Footnote 5 The vervet alarm calls, to take an example, are symptomatic. They are typical reactive and automatic responses to perceived stimuli. Typically vervets cannot omit, or produce-on-demand, such calls.

Interestingly, some apes seem to be “aware” of the largely involuntary nature of their vocalizations. Jane Goodall (1986) tells of a young chimp, excited upon discovering a banana, that ingeniously suppressed its pant hoots by muffling its mouth with its hand, so as not to inform the bigger chimpanzees of its discovery. Notice that the vocalization was affective and automatic, not intentional; the chimp’s voluntary control was in the manual, not vocal, domain. Even gibbon song, as impressive as it is to our musical sensibilities, is not flexible, under voluntary control, but is reactive/automatic and affective (see Geissmann 2002). Top-down (voluntary/intentional) aspects of great ape communication are largely gestural, not vocal (Hobaiter and Byrne 2014; Pika and Mitani 2006). That said, recent research indicates that chimpanzees have the capacity for voluntary vocal control (Slocombe and Zuberbühler 2007; Watson et al. 2015; Schel et al. 2013; Crockford et al. 2012; Fitch and Zuberbühler 2013). So, presumably, some basic and restricted form of intentional vocal control, and intentional response to conspecific vocalization, was available to the Pan-Homo last common ancestor, which ancient hominins (and chimpanzees and bonobos) have been incrementally building on.

The third hypothesis, then, amounts to a gestural conception of protolanguage (more accurately: a gesture-dominant conception—as just noted, vocalizations were present tooFootnote 6). Advocates of this view include Corballis (2002, 2009, 2017), Sterelny (2012, 2018), Hurford (2014), Tomasello (2008) and Hewes (1973).Footnote 7 This view posits straightforward continuity with great ape intentional communication. Intentional human gestural and vocal communication today employ similar neural systems (Newman et al. 2002; Kimura 1993). These systems are largely distinct from areas of the brain associated with emotion—the areas typically employed in primate vocal communication, and in automatic human utterances (e.g., in reflex swearing and grunting—these are regulated by different neural structures than ordinary language).Footnote 8 And to the extent that ontogeny is thought to recapitulate phylogeny, note that voluntary gestural communication is prior to voluntary vocal communication in human infant development (see e.g. Goldin-Meadow and Alibali 2013; Esteve-Gibert and Prieto 2014). From about 10 months, infants point in order to single out an object of interest, for example. And pointing, both by infants and carers, is a means by which the infant reduces referential ambiguity (Kalagher and Yu 2006; O’Neill et al. 2005). From 10 to 24 months, infants mimic the pantomime and playful gestures of carers. And young toddlers spontaneously produce iconic gestures from around 2 years of age—that is, around the same time as they begin to make two-word utterances (see Behne et al. 2014). Young deaf children exposed to sign language learn it in much the same way as they would learn speech, including going through a manual babbling phrase (Petitto and Marentette 1991). And deaf carers using sign language tend to produce slower, larger, exaggerated gestures, and more repetition, when interacting with deaf infants than with signing adults (Masataka 1992)—a gestural version of “motherese” (i.e., infant-directed speech).

In the gestural domain there is more scope for iconicityFootnote 9 (with due caution: see Irvine 2016 for developmental limitations). The hands and arms can mime physical shapes of objects, people, animals, and actions. And the intended meaning can often be easily comprehended (perhaps after a little practice/experience). Kim Shaw-Williams has pointed out to me just how difficult it is to describe the rather basic concept spiral in words; yet all it takes to communicate the idea is draw a spiral shape in the air with a pointed finger—plausibly, the same goes for communicating directions and pathways (e.g., which route we shall take on today’s hunt—and indeed for communicating during the hunt). The meaning of an iconic pot-stirring gesture, in the right context (say, a child helping a parent cook), is easily expressed and grasped. As Begby (2017) points out, such iconic gestures are spontaneously utilised and developed into an idiosyncratic homesign system by deaf children born to non-sign-language-using parents. These children are without access to a public conventional language yet they, and their families, have no problem establishing novel gestural means of communication, and over time these homesign systems develop to exhibit morphological and syntactic structure (Goldin-Meadow 2003). Moreover, even today, modern sign languages comprise many iconic gestures: see Fig. 1 for the signs for hammer (resembling the action of using a hammer) and deer (resembling the shape of deer’s antlers) in British Sign Language (BSL).Footnote 10

Fig. 1
figure 1

Extracted from Perniss and Vigliocco (2014), reproducible under the terms of the Creative Commons Attribution License 3.0

Iconic signs in BSL: a hammer; b deer.

Advocates of the gestural hypothesis stress gestural communication’s plausibility as a means for getting protolanguage established in our smaller brained hominin ancestors.Footnote 11 And they stress the intentional gestural communication of chimpanzees and bonobos, our closest great ape cousins (e.g. Pollick and de Waal 2007). Hostetter et al. (2001) point out that chimps will use pointing gestures with human experimenters to signify a desire (e.g., they will point to a banana, wanting the human to pass it to them), if the human is looking (indicating that it is intentional; Tomasello 2008). And they often gesture to one another. Hobaiter and Byrne’s important study, over the course of 266 days in Budongo National Park in Uganda, noted 4397 gestures of 66 kinds, such as stop that, follow me, go away, mate with me (see Hobaiter and Byrne 2011, 2014). Anecdotal reports of field-researchers describe (admittedly rare) instances of chimpanzee gestural “showing-how” communication. For example, a chimpanzee mother has been observed miming, in slow motion, how to hold a stone for nut-cracking, to her young daughter who hasn’t quite got the hang of it yet (Boesch 1993; see also Boesch and Boesch-Achermann 2000; Russon and Andrews 2011). Although attempts to teach apes to speak have been abject failures, human-trained great apes such as Kanzi have been taught sign systems rather effectively (albeit with limits on the number of elements that could be combined in a sentence that even young human children can supersede) and with an impressive mastery of the distinction between proper names and general categories. Individual great apes can be trained to communicate via sign because their genetic infrastructure enables it—thus great ape species are merely selection-limited with respect to protolanguage, not variation-limited (see Számadó and Szathmáry 2006 for the distinction).Footnote 12 In other words, as concerns protolanguage, there is no need to posit a discontinuity: chimpanzees and bonobos have “proto-protolanguage capacities” and are thus protolanguage-ready; they just need a context in which it pays them to develop protolanguage. I suspect we can safely presume that the Pan-Homo last common ancestor was likewise protolanguage-ready.Footnote 13

The brain science available also supports the gestural view. Recall that Broca’s area is a major neural site of language processing in humans (e.g. Novick et al. 2010); a homologous site is present in other primates but does not respond to conspecific vocalizations (Kohler et al. 2002). In macaques, this is where the famous mirror neurons were discovered that correlate the observation and production of gestures (e.g. Gallese et al. 1996; Rizzolatti et al. 1988; Rizzolatti and Arbib 1998). In humans today there is much overlap in the brain’s response/control of intentional gesture/sign language and speech (Rizzolatti and Craighero 2004; Capek et al. 2008; Newman et al. 2010), suggesting that in hominin evolution brain regions for primate gestural communication were recruited for modern speech (Hurford 2014).

So, henceforth, I shall assume a gesture-dominant view of protolanguage.Footnote 14 Yet the “sticking point” of gestural accounts is the need to explain how language shifted to a largely vocal medium (Burling 2005). In the next section, I spell out this sticking point.

Finding voice?

One explanatory challenge for gestural theorists is to specify how a predominantly gestural intentional communicative system transitioned to a predominantly vocal one (at least, among hearing individuals). Yet notice that we all (or at least the vast majority of us) continue to gesture while communicating vocally, sometimes without realizing. Many of us cannot give a lecture, conference presentation, or public speech, provide directions to a lost tourist, or explain how to execute a physical task to a novice, for example, without moving our arms about or gesturing in an expressive way. We point to things as an index; we accompany everyday speech with iconic pantomimic gestures (e.g., push gestures, rotate gestures).Footnote 15 We cup our ears to signify that we didn’t hear our interlocutor; we wave, we give thumbs-up, thumbs-down, and we flip the bird. Nonetheless the fact remains that for real-time intentional linguistic communication, vocalizations are now primary.

It is perhaps unsurprising that if language’s origins were gestural, such a transition would have taken place. There are a number of plausibly adaptive advantages for such a shift (properly qualified, of course; modern sign languages are harnessed by deaf people effectively just as spoken languages are by hearing people). For our hominin ancestors these include:

  1. 1.

    Lower time-costs; speaking is often quicker than gesturing.

  2. 2.

    Use in the dark, in dimly lit spaces (especially as fire control takes off, extending the usable hours of day for communication and other social pursuits), and in environments with limited conspecific visibility (like wooded or tall-sedge areas).

  3. 3.

    Use when one’s hands are occupied, or body occluded in general (e.g. in wading or bathing), as well as for pedagogical purposes (e.g. in “showing how” demonstrative teaching and learning) and so on.

  4. 4.

    To better address individuals not in one’s front line of sight (Elizabeth Irvine has pointed out that there are serious constraints on gestural communication when there are greater than 3 or 4 interlocutors—see also Sterelny 2016).

  5. 5.

    Freeing the eyes in general. Interlocutors can visually focus on what each other is doing (e.g., in toolmaking or teaching/learning) or look outside of the immediate conversational context.

  6. 6.

    For signalling signalhood (making articulate speech sounds is only good for speech; a wave might be to say hello or to shoo flies, an arm-up gesture could be confused with a stretch).

  7. 7.

    And once the transition is made, speech is metabolically less costly than gesture, requiring little more energy expenditure than that of breathing.

Yet the task remains to explain how the transition could have occurred—to ask what enabled it, its adaptive advantages notwithstanding. I take it that there are some salient constraints on giving a satisfactory explanation. The explanation has to be phylogenetically plausible. It has to be empirically constrained (i.e., tied where possible to the palaeoanthropological record) in order to explain why the transition occurred when it did. It has to not be reliant on low probability events or “magic bullet” scenarios. It must explain the move to a system comprising predominantly arbitrary symbols, from iconic and indexical communication. And finally, it must be compatible with the great complexity of vocal language production and comprehension (which in turn enables the great diversity of language: consider that there are over 1500 possible human speech sounds, though of course the phoneme inventories of all known languages each comprise only a portion of these; Evans 2009). Is there an explanation that is up to the job?

In the following section I describe and critique Corballis’s recent version of events (Corballis 2017). Corballis is a gestural theorist; indeed, one that sees the importance in delivering a hypothesis about how this stretch of the “Rubicon”—by which he means (in this context) the gap between gesture and speech (see p. 155)—was crossed, evolutionarily speaking.Footnote 16 Corballis emphasises three aspects: he argues that speech is more like gesture than other theorists have supposed, he stresses the role of mirror neurons, and he foregrounds the integration of the hand and mouth in eating, connecting the upshot of this with littoral theory (i.e., a version of the aquatic ape hypothesis), to explain increases in volitional vocal production.

Corballis tackles the gesture-speech transition

Corballis points out that characterizing the transition from gesture to speech as one of a shift in communicative modality (i.e., from the visual modality for perceiving gesture, to the auditory modality for perceiving speech) is too simplistic. He argues that gesture and speech are more alike than other theorists have supposed. This seems reasonable. He says:

…speech is itself a system of gestures, made up of movements of the lips, the velum, the larynx, and the blade, body, and root of the tongue. One might suppose, then, that the production of language shifted from one set of gestures to another (Corballis 2017, p. 148).

Moreover, it even seems as though modern humans typically comprehend speech as gestures, rather than as pure auditory patterns. The phonemes we perceive in speech do not map one–one onto sonic profiles. Corballis’s examples include the b sounds in “battle”, “bottle”, “beer”, “bug”, “rabbit”, “flibbertigibbet”. These b sounds probably sound much the same to ordinary listeners in ordinary conditions. However they have distinct acoustic profiles. So why do we group these b sounds together as a unitary phoneme? They are produced similarly by our vocal apparatus. Indeed sometimes we hear the speech we “see”, in spite of its sonic profile (McGurk and MacDonald 1976).Footnote 17 Of course, it’s not that we are completely insensitive to speech sounds as sounds: ‘both the way we produce speech and the auditory input itself contribute to the way we hear speech’ (Corballis 2017, p. 151). Both listening to speech and visual observation of lip movements for speech excite the motor units that underlie speech production (Watkins et al. 2003).

Corballis also emphasises the role of mirror neurons:

…some mirror neurons in the monkey brain respond to the sounds of actions, such as the tearing of paper or the cracking of nuts. But in monkeys, unlike humans, mirror neurons are deaf to vocalization—they don’t respond to the sounds of other monkeys. [But they do to gesture.] Only later in primate evolution did the mirror system encompass vocal sounds, enabling us to perceive speech in terms of the way it is produced rather than in terms of how it actually sounds. (Corballis 2017, p. 150, text in square brackets is mine).

Corballis points out that we are the only great apes to do this. But how? Here Corballis reaches for the dual-stream theory of speech perception (Hickok and Poeppel 2007; Hickok 2012) that distinguishes roles played by the dorsal and ventral streams, see Fig. 2. The ventral stream is responsible for understanding speech, and is shared in other animals (consider the ability of trained dogs, apes, and so on, to appropriately respond to distinct human vocalizations). The dorsal stream—part of the mirror system—is responsible for the production of human articulate speech, and although shared in other animals, what is absent, at least as far as the other great apes are concerned, is its cooption for vocalization production and learning. (Homologous neural circuitry distinguishes for example the vocal learning songbirds from other birds; Pfenning et al. 2014.) In humans the dorsal stream maps auditory/sonic representations onto articulatory motor representations, thereby coding and maintaining ‘instances of speech sounds, and [using] these sensory traces to guide the tuning of speech gestures so that the sounds are accurately reproduced’ (Hickok and Poeppel 2007, p. 399).

Fig. 2
figure 2

Image by Selket, available at https://commons.wikimedia.org/w/index.php?curid=1679336. Accessed 20 July 2017. Reproducible under the terms of the Creative Commons Attribution-ShareAlike License 3.0

Dorsal and ventral streams: the upwards arrow indicates the dorsal stream; the downwards arrow indicates the ventral stream.

Assuming this framework is on the right track—and it seems to be: despite some criticisms,Footnote 18 the dual-stream framework has been positively influential and is widely accepted in general (Eysenck and Keane 2015)—what remains to be explained is the cooption of the dorsal stream for vocalization production and learning, presumably after the split from the Pan-Homo LCA.

As we will see in the following section, in my view, increasingly intentional vocalization for emotion/affect, intentional listening, vocal grooming, infant-directed crooning, animal call mimicry, and so on (i.e., the mosaic comprising the evolution of “musicality”; Killin 2017) plausibly explains the cooption of the dorsal stream for intentional speech production and learning, in time. The idea: musicality provided both the medium (the voice), for articulate speech to transition from gesture, and the means (incrementally increasing upgrades in volitional control over vocal production apparatus). Corballis, however, takes a rather different approach.

Corballis notes that the hand and mouth are integrated, not only in communication, but in eating: ‘More to the point, though, throughout primate evolution the hands and mouth are connected through the process of eating. People and monkeys bring food to the mouth in exquisitely coordinated fashion. Such coupling may well carry over to language’ (Corballis 2017, p. 148). Corballis’s idea is that the primary focus of (proto)language passed from gesture, to the face (e.g., ‘The use of facial expressions as social signals’, p. 155), to vocalization (speech being a “swallowed” facial expression, requiring intentional production of sound), following the passage of food from hand to face to inside throat. Many of our facial expressions are reactive/emotional/automatic, of course, but we humans are also capable of fine voluntary control over facial movements (and good actors excel in this). Great apes are capable of it too, though to much less of an extent (i.e., much less flexible and less fine-tuned control). So, the idea is, over the course of evolution, ‘voluntary communication probably shifted increasingly from the hand to the face’ (p. 155); ‘It was but a small step from the external surface of the face to the movable parts inside it’ (p. 156). But, of course, we cannot see the intentional movements of parts “inside it”, at least not at all clearly enough for effective communication, so sound had to be produced in order to stand in as a proxy for a visually perceivable gesture.

Corballis spells out his hypothesis as follows. I quote:

In primates, hand and mouth are closely linked both in the brain and in behaviour. In the motor cortex of the brain, responsible for initiating body movements, the so-called hand area is adjacent to the mouth area. Some neurons in the frontal lobe of the monkey are activated when the animal makes a grasping movement either with the hand or with the mouth… [There is] a close connection between movements of hand and mouth in people as well. If human subjects are told to open their mouths while grasping objects with their hands, the size of the mouth opening increases with the size of the grasped object. [Presumably this is tied to preparing the mouth for the food it is about to receive.] If people are asked to say “ba” while grasping an object, or even while watching someone else grasp an object, the syllable itself is affected by the size of the object grasped. The larger the object, the wider the opening of the mouth, with consequent effects on the speech sounds. Even one-year-old babies show these effects [see e.g. Gentilucci 2003]. These links between hand and mouth probably originated in eating rather than communicating… [i.e., in] preparing the mouth to receive an object after the hand has grasped it, but they were adapted for gestural and finally vocal language (Corballis 2017, pp. 157–158, text in square brackets is mine).

Notice that even if we buy much of this (though there are reasons to be sceptical: for instance, the hands and face are under somewhat independent cortical control, and we might expect there to be more overlap if Corballis’ hand-to-face-to-mouth hypothesis were true), the stretch of the Rubicon between great ape gesture and hominin vocal (proto)language production and learning hasn’t yet been successfully crossed. The above facts about the connection between hand and mouth do not suggest increasing voluntary vocal control, but stimulus control: the size of the object reflexively influences mouth shape. So Corballis’s argument does not yet account for the extension of voluntary control progressively in roughly that direction, from hand to face to vocal tract, let alone that language transitioned progressively, starting from gesture, adding facial expression, adding voice. So Corballis appeals to littoral theory (a modern variant of the aquatic ape hypothesis; Verhaegen 2013) in order to posit a selection pressure for mouth-related voluntary control—that is, diving. Now, the probable bathing, wading, fishing, and shoreline foraging/shellfish gathering of our ancestors aside (see e.g. Shaw-Williams 2017), maybe some diving happened that required nonstandard intake and retention of air, but even so, the extent to which this would have had an upshot on voluntary vocal production control is extremely unclear. At best, introducing this selection pressure only helps to continue foregrounding greater breath control, not vocal production control. (And Corballis indeed thinks of the aquatic/littoral phase as ‘supporting the voluntary control of vocalization’, p. 163, my emphasis.) I agree that breath control is important to the story. But it alone does not explain the full gamut of vocal language production—including, recall, ‘movements of the lips, the velum, the larynx, and the blade, body, and root of the tongue’ (p. 148), let alone other evolutionary advances required for greater vocal capacities, such as the enlargement of the hypoglossal nerve (for tongue control), co-evolving vocal and auditory structures (neural and morphological—see Morley 2013), and thoracic vertebrae nerve canal expansion, enhancing intentional control of vocal musculature (MacLarnon and Hewitt 1999). As Kendon notes, ‘specializations for speaking, in regard to both the production of speech and its reception, are complex and extensive’ (Kendon 2016, my emphasis), not adequately accounted for by Corballis’s account.

Since Corballis’s account doesn’t get us to where we want to be, I think that we should shed some of his more shaky theoretical commitments, and consider another plausible in-road. In my view, an evolving vocal musicality helps to explain the transition towards vocal speech from gesture: it prepared hominins cognitively and anatomically for it. I expand on this hypothesis in the following section.

Musicking across the gap

Ancestral hominins communicating with one another presumably vocalized in combination with their intentional, meaningful gestures,Footnote 19 for instance to command attention of an individual looking another way, to imbue the gesture with affect (perhaps ‘playful, flirtatious, affiliative, competitive, or agonistic’—to borrow Kendon 2016’s examples), or to add vocal mimicry (of a mimed animal, say), and so on. In my view, much of this amounts to the addition of aspects of an evolving musicality.

Elsewhere I have developed a theory of Plio-Pleistocene hominin musicality, conceptualised as a mosaic or “package” of traits: increasing top-down control over affective vocalization, finer breath control (supported by anatomic changes), intentional listening, vocal imitation, turn taking, entrainment, lithic sound play, motherese, call mimicry, vocal grooming, and so on (Killin 2017; see also Killin 2016). I suggested that protomusical group behaviours/activities—group-based expressions of musicality not too dissimilar from some of those of ethnographically known foragers—are to be found in the socio-cultural/cognitive developments occurring, incrementally, during the “Late Acheulean”, say from 500 or 400 Kya (Killin 2017). This is based on an argument from hominin socio-cognitive coevolution (big brained, larger group, social hominins, with more developed emotional lives) centralized around hearths; the plausibility of these hominins possessing at least a protoaesthetic sensitivity (consider the 500,000 year old finely-crafted handaxes; see Kohn and Mithen 1999); the upgrades in technological production evidenced around this time (e.g., hafting; see Barham 2013); and using the date associated with more common and continual hearths as social magnets as a proxy, following Gamble et al. (2011) and Gowlett et al. (2012). Moreover, the ethnographic record details many ways of being musical that are simple enough and would not have left archaeological traces were our ancestors to have realized them or something like them.

Yet prior to this date for the (hypothesised) emergence of social protomusic, the picture of musicality evolution that I favour has consequences for language evolution and the switch to speech. Some of the musicality ingredients are verbal language ingredients too. As call mimicry, motherese, vocal grooming, sound-based play, intentional listening and the like took hold, the anatomical and cognitive preconditions for speech would evolve. Vocalizations, increasing in complexity and diversity, would be incorporated into protolanguage along with gesture, coming under top-down control, in co-evolutionary tandem. This would be an incremental process, occurring throughout the long stretch of the Pleistocene, resulting in the modifications of the hominin vocal tract that distinguish it from that of the other great apes, specializations for speech.

Our Hominini ancestors split from those of Pan around 6–7 Mya. It is highly likely that their non-arboreal forager lifestyle (from 4 Mya) would have selected for an increase in vocalization use for conspecific communication and in group defence from predators. Stone tool production and use appears in the archaeological record as far back as 3.3 Mya. It is not a far stretch to think that vocalizations for coordinating group hunting/scavenging, driving predators from kills, and carcass carving would have occurred; these would also have sent signals to nearby bands. Hunting/scavenging and social tracking (Shaw-Williams 2014) would have rehearsed finer attention to sounds, bringing listening under increasing intentional control. The same is true of tool production: attentive, intentional listening is key in both knapping and diagnosing raw material for use. As has been suggested by Robin Dunbar (e.g. Dunbar 1996), emotional/affective vocalizations are likely to have entered the picture, supplementing the manual grooming that maintains social bonds in other great apes (intensifying selection for intentional signal control, listening, and coordination between voices), and potentially playing a role in parent-infant bonding and affecting infant arousal (e.g. “lullaby”-like vocalizations to soothe; “arousal” vocalizations to excite) especially as our female ancestors encountered the obstetrical dilemma and birthed earlier, less developed (and helpless) babies, requiring greater demands in care. Vocalizations may have also played a role in courtship (indeed, in social worlds such as those of the Pleistocene, hominins were presumably constantly scrutinizing one another and vocalizations would have played an inescapable, even if implicit, part in mate selection).

Palaeoanthropological evidence is consistent with this picture of incremental coevolution of vocalization and audition. From around 2 Mya onwards, Homo began to develop neural and anatomical changes presumably in tight lock-step: enlargement of the hypoglossal nerve (enabling finer tongue control), coevolving vocal and auditory structures (neural and anatomical), and in time (at least by heidelbergensis), expansion of the thoracic vertebrae nerve canal (enabling greater intentional control of breath and vocal musculature). For review, see Morley (2013).

At some point, as hominin life stages evolved throughout the Pleistocene, infant babbling began—a deeply entrenched, universal behaviour which allows the infant to rehearse finer control over vocalization production through play-like behaviour (Merker 2012), leading to improved intentional control over vocal musculature in adults. Other primate infants don’t babble; among the great apes it is a hominin adaptation.

The various features indicated above emphasise an increasing phonological complexity. For example, call mimicry is a crucial upgrade to the skill set of hunters and foragers, and would have been an effective addition to gestural/pantomimic communication of hunting plans (e.g. Bickerton 2009), as well as in status-securing, pedagogical, or “blowing off steam” hunting stories. Mid-Pleistocene hominins did not merely live in the here-and-now: they were highly proficient medium- to large-game hunters engaging in coordinated, cooperative, planned activities (Bunn and Pickering 2010a, b).

Finally, the production methods of the Late Acheulean industry reveal salient cognitive advances taking place: greater imitation, shared intentionality, greater episodic memory, mental templates, attentive focus, impulse control, greater intentional motor control, and importantly for language evolution, upgrades in communication and social learning and greater intentional listening (Killin 2017). As our ancestors’ cognitive abilities developed, they would have become better at distinguishing similar vocal sounds and comprehending/understanding the intentional vocalizations of conspecifics. And from 800,000 years ago, an incremental, though momentous brain size increase took place. More effective, multi-modal protolinguistic communication, enabling better foraging and control of fire, teaching and learning, affective and social communication, and cooperation and coordination in general, are almost certainly among the factors driving this final surge in hominin encephalization. At least by heidelbergensis—the ancestor of Neanderthals and modern sapiens—the palaeoanthrological evidence is suggestive of near-modern vocal control and the morphological means to produce a near-modern vocal repertoire of sounds (Morley 2013; Dunbar 2014), which would continue to evolve and come under greater intentional control throughout the long passage to anatomical and cognitive modernity. With such anatomical and cognitive changes occurring from ergaster/erectus to heidelbergensis to sapiens—changes in the domain of affective vocalization, and eventually the emergence of group protomusic—the transition from predominantly gestural to verbal protolanguage is a predictable consequence, given the considerations listed earier.

Towards arbitrariness: the echo phonology hypothesis

It might be objected at this point that I have not yet provided an adequate explanation of the transition from gesture to speech, since it is hard to see how it provides a mechanism for a switch from a largely iconic and deictic/indexical form of communication to one in which mappings from symbol to meaning are largely arbitrary. In the context of the present dialectic, however, notice that Corballis’s view does not either; in fact, Corballis’s approach is to down-play the extent to which language is truly “arbitrary”. (Indeed the extent to which spoken words resemble their referents has been debated at least since Plato’s Cratylus.) However, Corballis’s move here is hardly convincing, in my view. Even if language is less arbitrary than some theorists have supposed,Footnote 20 it is nonetheless widespread enough to be an explanandum requiring attention.

Admittedly, the objection that I am entertaining here may be overstated: it has been claimed that once language has gone largely vocal, it automatically goes largely arbitrary as a byproduct of this shift (e.g. Sterelny 2018); that is, there isn’t two problems—the shift to vocal dominance and the shift to arbitrary sign dominance—just the first, the shift to vocal dominance. And in any case, iconic and indexical gestures can become conventionalized and thus “arbitrarified” through familiar processes, drifting to the arbitrary as a result (Tomasello 2008). I am sympathetic to this rendering of the matter, however in this section I will consider more seriously those who require more to be said on the uptake of arbitrariness and the gesture–vocal transition in particular. Kendon (2016) puts it thusly: ‘Even if we can suggest factors that might have contributed to the elaboration of complexity in vocal (and gestural) expression… the issue of how these gestures acquired symbolic significance still eludes us’. Settling this matter persuasively is not possible in the remaining space, and I suspect that there are a range of mechanics that tend to push symbol systems towards arbitrariness (Gasser 2004).Footnote 21 I nod towards one plausible answer: the echo phonology evolutionary hypothesis (Woll 2014).Footnote 22

Echo phonology (Woll 2001) is the name given to the class of mouth gestures that obligatorily accompany signed gesture in contemporary sign languages. The mouth gesture is a motoric and visual “echo” of the hand/arm gesture, hence the term “echo phonology” (the mouth gesture is typically not voiced, and is not related to, or derived from, vocal movements for speech production). Consider:

In the BSL sign true… the upper hand moves downwards to contact the lower hand, and this action is accompanied by mouth closure, synchronized with the hand contact… the mouth gesture forms part of the citation form of the manual sign [and does not carry additional meaning; e.g., it does not function to distinguish between gestural homonyms, or add an adverb to a signed action, etc.]… Signs with echo phonology appear incomplete or ill-formed in their citation form if the mouth gesture is not present (Woll 2014, p. 4, text in square brackets is mine; also see Fig. 3).

Fig. 3
figure 3

Extracted from Capek et al. (2008), reproduced here with permission of MIT Press

Echo phonology: BSL true.

Neural studies indicate that network activation underlying the processing of signs with echo phonology lies somewhere in-between that of manual-sign-only processing and processing of signs accompanied by mouthings that function to disambiguate homonymous signs (which more closely resembles that of the processing of lips in proficient speechreaders) (Capek et al. 2008; for further discussion see Woll 2014).Footnote 23 What is impressive about Capek and colleagues’ study is the internal consistency found across the various cases: in each case, ‘the more active region was that which was more involved in processing hand movements than mouth movements’ (Capek et al. 2008, pp. 1231–1232). Woll connects these findings with a plausible evolutionary hypothesis.

Although the mouth gestures of echo phonology are usually not voiced by deaf signers, hearing signers sometimes add an audible vocal component. In these hearing bilinguals (e.g. those fluent in English and BSL), in contexts in which codes are blended, the gestural component of a sign may be dropped, and only the mouth component provided, with or without audible voicing (Woll 2014). According to Woll’s evolutionary hypothesis, echo phonology points to a possible means for leaping from ‘a situation where voicing accompanies these mouth gestures so that they begin to have independent existence as lexical items… Echo phonology illustrates a mechanism by which abstract concepts, which can be represented by iconic manual gestures, can be attached to abstract mouth gestures’ (Woll 2014, pp. 5–6). The idea: an expression that our ancestors may have iconically gestured or pantomimed may have been (involuntarily or otherwise) accompanied by the mouth simultaneously performing the action of the hands/arms (the link between the hand and mouth in the mirror system presumably having some role in this). Here’s a hypothetical pantomimic/iconic example (indulge me): outstretched forward-pointing arms snapping into one another, like a crocodile’s snout, accompanied by a (voluntary or involuntary) snapping (voiced or silent) of the teeth.Footnote 24 And once familiar, the hand/arm gesture could be omitted (perhaps the arms are busy pointing in some direction or miming out a route or coordination plan) with the mouth gesture/voicing “taking over” for crocodile. In my view: an evolving musicality provides the impetus for such a transition in general and several plausible upshots (e.g., expanding the possibility space of producible phonemes, providing the physical means for intentional voicings in the absence of the associated manual gesture, enabling coevolution with the hominin auditory channel to better align hearing ranges and vocal ranges).

Of course, the hypothesis is presented here somewhat tentatively. Further investigation into echo phonology and its import to the evolution of language debate is a priority for future research. One key question to investigate is why it happens. Some aspects of sign language may be present due to the fact that signers so often have to communicate with hearers, even if (some of) those hearers are also signers, so it would be worth knowing how variable and explicit the presence of echo phonology is across signers who mostly communicate with other deaf signers and those who do not.Footnote 25 Another, for example, is to gain a sense of how ancient echo phonology is, including whether and to what extent it accompanies great ape intentional gesture, and to gain a sense of how universal it is across human peoples, including whether and to what extent it appears in homesign systems. Needless to say, Woll is optimistic. I’ll give her the final words of this section:

One issue for those concerned with suggesting a link between gesture and word has always been how the arbitrary symbol-referent relationship of words in spoken language could have come from visually-motivated gestures. Echo phonology provides evidence for a possible mechanism. Firstly, the phenomenon appears to be fairly common across different sign languages… Secondly, the mouth actions found in echo phonology are themselves non-visually motivated… Thirdly, the actual inventory of elements in echo phonology looks very much like a system of maximal contrasts in a spoken language phonology…Fourthly, functional imaging research on the representation of signs and words in the brain suggests that echo phonology occupies an interesting intermediate position. (Woll 2014, p. 8).

Summary

I hope to have convinced you that the origins of language, as we know it today, are largely in the gestural domain, continuous with the intentionally communicative behaviours of the great apes. I have evaluated one recent account, put forward by an influential gestural theorist, of how gestural protolanguage might have transitioned to speech. I found that account wanting, and I argued that an independently evolving musicality played a key role in preparing ancient hominins for vocal language. Further research is required to develop this idea and generate testable hypotheses. I hope to have convinced you of a “proof of concept”, nonetheless. (At least, it fares better against the set of constraints identified than Corballis’s account does). I suggested that further research on echo phonology may shed light on a possible mechanism for the transition towards arbitrariness in speech.