Abstract
Speech contains temporal structure that the brain must analyze to enable linguistic processing. To investigate the neural basis of this analysis, we used sound quilts, stimuli constructed by shuffling segments of a natural sound, approximately preserving its properties on short timescales while disrupting them on longer scales. We generated quilts from foreign speech to eliminate language cues and manipulated the extent of natural acoustic structure by varying the segment length. Using functional magnetic resonance imaging, we identified bilateral regions of the superior temporal sulcus (STS) whose responses varied with segment length. This effect was absent in primary auditory cortex and did not occur for quilts made from other natural sounds or acoustically matched synthetic sounds, suggesting tuning to speech-specific spectrotemporal structure. When examined parametrically, the STS response increased with segment length up to ∼500 ms. Our results identify a locus of speech analysis in human auditory cortex that is distinct from lexical, semantic or syntactic processes.
Similar content being viewed by others
References
Stevens, K.N. Acoustic Phonetics (MIT Press, 2000).
Poeppel, D., Idsardi, W.J. & van Wassenhove, V. Speech perception at the interface of neurobiology and linguistics. Phil. Trans. R. Soc. Lond. B 363, 1071–1086 (2008).
Scott, S.K., Blank, C.C., Rosen, S. & Wise, R.J. Identification of a pathway for intelligible speech in the left temporal lobe. Brain 123, 2400–2406 (2000).
Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402 (2007).
Rauschecker, J.P. & Scott, S.K. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat. Neurosci. 12, 718–724 (2009).
Binder, J.R. et al. Human temporal lobe activation by speech and non-speech sounds. Cereb. Cortex 10, 512–528 (2000).
Liebenthal, E., Binder, J.R., Spitzer, S.M., Possing, E.T. & Medler, D.A. Neural substrates of phonemic perception. Cereb. Cortex 15, 1621–1631 (2005).
Obleser, J., Zimmermann, J., Van Meter, J. & Rauschecker, J.P. Multiple stages of auditory speech perception reflected in event-related fMRI. Cereb. Cortex 17, 2251–2257 (2007).
Wild, C.J., Davis, M.H. & Johnsrude, I.S. Human auditory cortex is sensitive to the perceived clarity of speech. Neuroimage 60, 1490–1502 (2012).
Giraud, A.L. et al. Contributions of sensory input, auditory search and verbal comprehension to cortical activity during speech processing. Cereb. Cortex 14, 247–255 (2004).
Obleser, J., Eisner, F. & Kotz, S.A. Bilateral speech comprehension reflects differential sensitivity to spectral and temporal features. J. Neurosci. 28, 8116–8123 (2008).
Zatorre, R.J. & Belin, P. Spectral and temporal processing in human auditory cortex. Cereb. Cortex 11, 946–953 (2001).
Schönwiesner, M., Rübsamen, R. & von Cramon, D.Y. Hemispheric asymmetry for spectral and temporal processing in the human antero-lateral auditory belt cortex. Eur. J. Neurosci. 22, 1521–1528 (2005).
Boemio, A., Fromm, S., Braun, A. & Poeppel, D. Hierarchical and asymmetric temporal sensitivity in human auditory cortices. Nat. Neurosci. 8, 389–395 (2005).
Overath, T., Kumar, S., von Kriegstein, K. & Griffiths, T.D. Encoding of spectral correlation over time in auditory cortex. J. Neurosci. 28, 13268–13273 (2008).
Overath, T., Zhang, Y., Sanes, D.H. & Poeppel, D. Sensitivity to temporal modulation rate and spectral bandwidth in the human auditory system: fMRI evidence. J. Neurophysiol. 107, 2042–2056 (2012).
Greenberg, S. A multi-tier framework for understanding spoken language. in Listening to Speech: An Auditory Perspective (eds. S. Greenberg & W.A. Ainsworth) 411–433 (Lawrence Erlbaum, 2006).
Rosen, S. Temporal information in speech: acoustic, auditory and linguistic aspects. Phil. Trans. R. Soc. Lond. B 336, 367–373 (1992).
Efros, A.A. & Leung, T.K. Texture synthesis by non-parametric sampling. in IEEE Int. Conf. Comp. Vis. 1033–1038 (1999).
Grill-Spector, K. et al. A sequence of object-processing stages revealed by fMRI in the human occipital lobe. Hum. Brain Mapp. 6, 316–328 (1998).
Lerner, Y., Honey, C.J., Silbert, L.J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci. 31, 2906–2915 (2011).
Pallier, C., Devauchelle, A.-D. & Dehaene, S. Cortical representation of the constituent structure of sentences. Proc. Natl. Acad. Sci. USA 108, 2522–2527 (2011).
Abrams, D.A. et al. Decoding temporal structure in music and speech relies on shared brain resources but elicits different fine-scale spatial patterns. Cereb. Cortex 21, 1507–1518 (2011).
Giraud, A.L. et al. Representation of the temporal envelope of sounds in the human brain. J. Neurophysiol. 84, 1588–1598 (2000).
Harms, M.P., Guinan, J.J., Sigalovsky, I.S. & Melcher, J.R. Short-term sound temporal envelope characteristics determine multisecond time patterns of activity in human auditory cortex as shown by fMRI. J. Neurophysiol. 93, 210–222 (2005).
McDermott, J.H. & Simoncelli, E.P. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71, 926–940 (2011).
Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J. & Ekelid, M. Speech recognition with primarily temporal cues. Science 270, 303–304 (1995).
Davis, M.H. & Johnsrude, I. Hierarchical processing in spoken language comprehension. J. Neurosci. 23, 3423–3431 (2003).
Fedorenko, E., Hsieh, P.J., Nieto-Castanon, A., Whitfield-Gabrieli, S. & Kanwisher, N. New method for fMRI investigations of language: defining ROIs functionally in individual subjects. J. Neurophysiol. 104, 1177–1194 (2010).
Lashkari, D., Vul, E., Kanwisher, N.G. & Golland, P. Discovering structure in the space of fMRI selectivity profiles. Neuroimage 50, 1085–1098 (2010).
Formisano, E., De Martino, F., Bonte, M. & Goebel, R. “Who” is saying “what”? Brain-based decoding of human voice and speech. Science 322, 970–973 (2008).
Mesgarani, N., Cheung, C., Johnson, K. & Chang, E.F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).
Kanwisher, N., McDermott, J. & Chun, M.M. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J. Neurosci. 17, 4302–4311 (1997).
Price, C., Thierry, G. & Griffiths, T. Speech-specific auditory processing: where is it? Trends Cogn. Sci. 9, 271–276 (2005).
Schirmer, A., Fox, M.P. & Grandjean, D. On the spatial organization of sound processing in the human temporal lobe: a meta-analysis. Neuroimage 63, 137–147 (2012).
Ghitza, O. On the role of theta-driven syllabic parsing in decoding speech: intelligibility of speech with a manipulated modulation spectrum. Front. Psychol. 3, 238 (2012).
Rauschecker, J.P. Cortical processing of complex sounds. Curr. Opin. Neurobiol. 8, 516–521 (1998).
Norman-Haignere, S., Kanwisher, N. & McDermott, J.H. Cortical pitch regions in humans respond primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex. J. Neurosci. 33, 19451–19469 (2013).
Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P. & Pike, B. Voice-selective areas in human auditory cortex. Nature 403, 309 (2000).
Liebenthal, E., Desai, R.H., Humphries, C., Sabri, M. & Desai, A. The functional organization of the left STS: a large scale meta-analysis of PET and fMRI studies of healthy adults. Front. Neurosci. 8, 289 (2014).
Peelle, J.E. The hemispheric lateralization of speech processing depends on what “speech” is: a hierarchical perspective. Front. Hum. Neurosci. 6, 309 (2012).
Cogan, G.B. et al. Sensory-motor transformations for speech occur bilaterally. Nature 507, 94–98 (2014).
McGettigan, C. et al. An application of univariate and multivariate approaches in FMRI to quantifying the hemispheric lateralization of acoustic and linguistic processes. J. Cogn. Neurosci. 24, 636–652 (2012).
Voss, R.F. & Clarke, J. 1/f noise in music and speech. Nature 258, 317–318 (1975).
Attias, H. & Schreiner, C.E. Temporal low-order statistics of natural sounds. in Advances in Neural Information Processing Systems, Vol. 9 (eds. M.C. Mozer, M.J. Jordan, & T. Petsche) 27–33 (MIT Press, 1997).
Meyer, M., Alter, K., Friederici, A.D., Lohmann, G. & von Cramon, D.Y. fMRI reveals brain regions mediating slow prosodic modulations in spoken sentences. Hum. Brain Mapp. 17, 73–88 (2002).
Humphries, C., Sabri, M., Lewis, K. & Liebenthal, E. Hierarchical organization of speech perception in human auditory cortex. Front. Neurosci. 8, 406 (2014).
Turken, A.U. & Dronkers, N.F. The neural architecture of the language comprehension network: converging evidence from lesion and connectivity analyses. Front. Syst. Neurosci. 5, 1 (2011).
Lau, E.F., Phillips, C. & Poeppel, D. A cortical network for semantics: (de)constructing the N400. Nat. Rev. Neurosci. 9, 920–933 (2008).
Petkov, C.I., Logothetis, N. & Obleser, J. Where are the human speech and voice regions, and do other animals have anything like them? Neuroscientist 15, 419–429 (2009).
Desmond, J.E. & Glover, G.H. Estimating sample size in functional MRI (fMRI) neuroimaging studies: Statistical power analyses. J. Neurosci. Methods 118, 115–128 (2002).
Moulines, E. & Charpentier, F. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9, 453–467 (1990).
Brainard, D.H. The psychophysics toolbox. Spat. Vis. 10, 433–436 (1997).
Friston, K.J. et al. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210 (1995).
Rademacher, J. et al. Probabilistic mapping and volume measurement of human primary auditory cortex. Neuroimage 13, 669–683 (2001).
Westbury, C.F., Zatorre, R.J. & Evans, A.C. Quantifying variability in the planum temporale: a probability map. Cereb. Cortex 9, 392–405 (1999).
Brett, M., Anton, J.-L., Valabregue, R. & Poline, J.-B. Region of interest analysis using an SPM toolbox (abstract). Neuroimage 16 (suppl. 2), (2002).
Acknowledgements
The authors thank K. Doelling for assistance with data collection, G. Lewis for extensive help with visualization of the results using FreeSurfer, E. Fedorenko for assistance with the parcellation algorithm, D. Ellis for implementing the PSOLA algorithm for segment concatenation, the volunteers who kindly allowed us to record their speech, T. Schofield, N. Kanwisher and J. Golomb for helpful discussions, and N. Ding, A.-L. Giraud, E. Fedorenko, S. Norman-Haignere and J. Simon for helpful comments on earlier drafts of the manuscript. This work was supported by US National Institutes of Health grant 2R01DC05660 to D.P., a GRAMMY Foundation Research Grant to J.M.Z., and a McDonnell Scholar Award to J.H.M.
Author information
Authors and Affiliations
Contributions
T.O., J.H.M., J.M.Z. and D.P. designed the experiments, interpreted the data and wrote the manuscript. J.H.M. designed the quilting algorithm and generated the stimuli. T.O. and J.M.Z. acquired the fMRI data. J.H.M. acquired the behavioral data. T.O. and J.H.M. analyzed the data.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Glassbrain projection of the group L30 > L960 functional localizer contrast.
The results are shown at a statistical significance threshold of p < 0.001 (uncorrected for multiple comparisons).
Supplementary Figure 2 Frequency power spectra for speech and modulation control sounds.
Frequency power spectra for speech (solid) and modulation control sounds (dashed) quilted with 30 ms (red) and 960 ms (blue) segment durations.
Supplementary Figure 3 Replicability across scanning sessions.
Replicability across scanning sessions for 12 participants who were scanned between two and four times. The graphs plot the BOLD response normalized by the response to the 960 ms functional localizer condition in each participant’s individual fROI for the right and left hemispheric individual fROIs (red and blue, respectively) in the individual scanning sessions (dashed, dashed-dotted, dotted) that included the original speech quilts (i.e. not compressed speech quilts). The solid line plots the average across scanning sessions. Note that the majority of participants exhibits a plateau at around 480 ms segment duration.
Supplementary Figure 4 Clustering analysis of voxel response profiles30.
a) A permutation test revealed that only the top-ranked cluster (out of nine) was statistically significant (red line segment). b) Profile of the top-ranked discovered cluster for the eight experimental conditions (L30, L960, S30 to S960). Note that the data were mean-centered, which is why the response profile is negative for the conditions yielding a low response. c) Rendering of this cluster on coronal cross-sections of our participants' average structural images (y = -38, -30, -22, -14, -6, 2, 10), thresholded at a voxel by functional system assignment probability of r ≥ 0.7.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–4 and Supplementary Tables 1 and 2 (PDF 3438 kb)
Rights and permissions
About this article
Cite this article
Overath, T., McDermott, J., Zarate, J. et al. The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nat Neurosci 18, 903–911 (2015). https://doi.org/10.1038/nn.4021
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nn.4021
- Springer Nature America, Inc.
This article is cited by
-
Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music
Communications Psychology (2024)
-
Spontaneous emergence of rudimentary music detectors in deep neural networks
Nature Communications (2024)
-
A modality-independent proto-organization of human multisensory areas
Nature Human Behaviour (2023)
-
Model metamers reveal divergent invariances between biological and artificial neural networks
Nature Neuroscience (2023)
-
Joint, distributed and hierarchically organized encoding of linguistic features in the human auditory cortex
Nature Human Behaviour (2023)