Teaching and learning science presuppose conversational forms—there are no science lessons, no teacher training, and no in-service workshops without talk. Conversations in and oriented toward the concrete realization of cultural-historically developed forms of activities constitute the crucial location where individual members contribute to producing and reproducing society as we know and experience it in our everyday lives. That is, society comes to life in and through talk in institutions and institutional talk. Face-to-face encounters and the conversations they give rise to are important, because they are the sites that mediate between individual and collective emotions that serve as the fuel for social life. In co-presence, participants in face-to-face encounters not only provide each other with information but “any message that an individual sends is likely to be qualified and modified by much additional information that others glean from him simultaneously, often unbeknownst to him” (Goffman 1963, p. 15). This other information includes facial and other bodily displays and, important here, a variety of prosodic means, such as prosodic orientation (Szczepek Reed 2006). These additional means allow the co-articulation and co-communication of emotions because face-to-face encounters incorporate interaction rituals “in which participants develop a mutual focus of attention and become entrained in each other’s bodily micro-rhythms and emotions” (Collins 2004, p. 47). In this study, we understand face-to-face conversations in science lessons as special interaction rituals that lead to the production and reproduction of macrolevel structures (e.g., schooling). In these rituals, participants draw on microlevel resources that mediate the outcomes of these rituals, including emotions/emotional energy and solidarity.

Whereas there are considerable studies of face-to-face transactions and prosody, the literature on the pragmatic use of prosody for structuring social life in everyday encounters is limited. In particular, there is a dearth of research concerning the use of prosody in situations characterized by institutional differences in power crossed with other differences of sociological import such as gender and culture. The purpose of the present study is to report on the role of prosody in intra- and intercultural communication in science classes taught in low performing (as per state assessment), inner city high schools. In this study we provide evidence from the analysis of ethnographic data and prosody that successful and unsuccessful lessons—as indicated by outcome measures such as quizzes and tests—are associated with the production and reproduction of prosodic alignment and misalignment. Ethnographic evidence shows that in classes and teacher arrangements where we observe alignment in prosody, participants report feeling a sense of solidarity, whereas where the prosodies of speakers are misaligned, participants report the presence of conflict. Ultimately, therefore, our study shows not only how school science is reproduced but also how—at the heart of an often-dysfunctional educational system—positive emotional climates, which are requisites for the occurrence of successful science teaching and learning, are produced.

The sociology of face-to-face encounters

In this study, we take as the relevant unit of analysis the sequential organization of actions that reproduce and transform social life. Early anthropological work suggests that when persons engage each other, they tend to regulate, synchronize, and adapt their conduct—that is, they act consciously in relation to one another. Sociologists, too, have held that transaction participants tend to harmonize their actions both at macroscopic (observable) and microscopic (unconscious) levels: “established by the reciprocal sharing of the Other’s flux of experience in inner time, by living through a vivid present together” (Schutz 1971, p. 173), which leads to an experiencing of the togetherness as a “We.” In his classic article on making music together, Schutz establishes a “mutual tuning-in relationship” as musicians warm up, establishing a beat and rhythmic entrainment. Human emotions regulate the “focus and flow of encounters with others, while developing commitments to the culture and structure of corporate and categoric units and the macrolevel institutional systems built from these mesolevel structures” (Turner 2002, p. 76). We show that specific prosodic features in face-to-face encounters—alignment and misalignment—are associated with the production of solidarity and conflict, which in turn are associated with successful and unsuccessful lessons. They are also associated with different degrees of solidarity and emotional energy that participants in science classrooms experienced. In the following three subsections, we articulate the relevant background of our study that situates itself at the intersections of different domains of interests: (a) microsociology and emotions, (b) the relationship of emotion and prosody, and (c) the relation of prosody, power, and control.

Microsociology and emotions

Interaction rituals are understood to involve bodily co-presence, barriers to outsiders, mutual focus of attention, and shared mood (transient emotional stimulus) as their ingredients. Two components especially—mutual focus of attention and transient emotions—are intensified through feedback and rhythmic entrainment. The ritual outcomes include solidarity, emotional energy in individuals, symbols of social relationships, and standards of morality as outcomes. Because interaction rituals are linked in chains, we understand the outcomes of previous interaction rituals to be resources of agency in subsequent interaction rituals. In this study, we focus on prosody, the variations in the sound stream speakers produce in talking, as information they give off, that is, information they provide generally without choosing. It turns out that as information, prosody constitutes a resource in and for face-to-face communication, where it is an expression of, and has specific correlations with, transient emotions (Scherer 1989). Our study provides evidence for the existence of feedback and entrainment phenomena, which we suggest to be possible sources for the solidarity and conflict that our ethnographic observations and interviews with participants reveal.

When verbal exchanges occur in a cultural field such as a school classroom, a great deal can be at stake and participants likely have multiple goals that are salient concurrently. Accordingly, if and when, as happened in this study, a teacher and a student offer competing models of explaining something, they may both take a stand on their respective points of view. Because speaking also gives off information, teachers and students communicate their transient emotions associated with taking a stand and with respect to being challenged. Much of this additional information comes from prosody by means of which speakers make transient emotions available to one another (Roth and Middleton 2006). Therefore, if an African American student chooses to describe to the whole class a model for chemical valence she is probably concerned with more than testing whether or not her model is robust. In this student’s African American culture, earning and maintaining the respect of her peers is usually a central concern, as are self-conceptions and forms of identity more generally. Hence, face-to-face transactions—where oppositional cultures and identities are expressed in linguistic codes—are sites for the production of social capital in that material and schematic structures dynamically unfold. Participants therefore accrue capital consciously and unconsciously, they may lose face or they simultaneously may gain social capital among their friends and lose it with respect to the school and schooling.

Emotions are the key to understanding the unfolding of interaction rituals during which not only individual lessons but also schooling are continuously produced and reproduced. From a phenomenological perspective of the individual social actor emotions are integral to the way in which human beings orient to the (social, natural) world, where the emotional relief never is given a priori but always is the result of a continually renewed appreciation of the situation. Emotions are the very foundation of otherness, defining at once some of the most fundamental properties of social systems and some of the differences that define social structure (Scheler 1933). Both transient emotions and long-term emotional valences are expressed in and through the sound stream produced in talking. Here, “the phonetic ‘gesture’ brings about, both for the speaking subject and for his hearers, a certain structural co-ordination of experience, a certain modulation of existence, exactly as a pattern of my bodily behavior endows the objects around me with a certain significance both for me and others” (Merleau-Ponty 1945, p. 225). But speakers always assume a position in a world that already is shared with others, so that force of speech is something real that comes to speakers from the audiences to which they are oriented and whose alignment therefore is of a “mechanical sort.” Various non-conscious and unconscious aspects of communicative actions are possible carriers of social alignment. Historically, scholars have attended to the visually provided information participants give off, but, as we show in this article, auditory aspects may have an important and yet to be studied role in the pragmatics of interaction rituals.

Emotions are understood as complex response dispositions for engaging in certain classes of adaptive conduct. The different emotions are characterized by distinctive states of physiological arousal—negative and positive feelings associated with affective states, stimulation, and patterns of expressive reactions. For example, fear and anger are said to energize persons to engage in urgent actions when they are facing physical dangers and bodily harm or threats from other people; and satisfaction not only allows people to rest but also to strive for important goals (Kemper 1987). In both these examples, emotions (a) have a short-term, momentary regulative function of facilitating the unconscious elements that constitute goal-oriented actions and (b) are cultural resources—which we understand in terms of their role in a structure | agency dialectic—that mediate the selection of long-term goals and participation in particular forms of societal activities. Thus, face-to-face encounters not only presuppose emotions but also produce and transform them. For example, in power/status models of emotional mediation of micro-transactions, the activation of negative or positive emotions is related to whether anticipated relative status or power relations are realized (Kemper and Collins 1990). A person may become defensive to negative responses from others, and activate a defense mechanism because the self is threatened. These emotions not only drive but also exhibit themselves in face-to-face encounters and thereby constitute structural resources and constraints available to the agency of any or all participants in an encounter. That is, we (the authors) see in emotion an aspect of human culture that squarely fits within existing approaches to cultural sociology.

Emotions and prosody

There is considerable psychological research on the recognition of emotion. One meta-analytic study—involving 87 articles, 97 studies, and 182 independent samples—shows that emotions are recognized above chance within and across cultures (Elfenbein and Ambady 2002) and with different levels of accuracy across channels (e.g., modes such as happiness and anger). An important channel for communicating emotion is the sound stream produced in speaking. (There are other channels that sometimes are more important for conveying information. For example, happiness is most accurately identified in facial expression, and anger is the emotion most accurately recognized in the voice.) It is now well established that speech carries—in addition to semantic content—information about the speaker’s intentions and emotional state and that listeners are capable of perceiving this information without having to stop and reflect about it.

The fundamental frequency of phonation (in the literature referred to as pitch or F 0), speech intensity, and speech rate individually and in combination constitute information about the current emotional state of the speaker. Pitch related parameters, including short-term perturbations, long-term variability, and mean value, are among the measures often reported to correlate with levels of speaker emotional stress, either task-induced or in real-life emergencies (e.g., Roth 2007a). Intact pitch information, including gross changes and fine temporal structure, is crucial for the accurate identification of emotions. There are indications that mean and maximum pitch levels and pitch contours are the most salient pitch-related measures that convey emotional information. Anger in particular appears to be encoded through the fundamental frequency (pitch), though speech intensity often correlates with anger (Pittam and Scherer 1993), such as in shouting. The research suggests that anger also is associated with increases in the incidence of high-frequency speech components, downward-directed F 0 contours, and increasing articulatory rates.

From our perspective, extant studies of the correlation of prosody and emotion have two major shortcomings: First, they are largely laboratory studies and therefore do not address irremediably contingent, ongoing pragmatic deployment of prosody in the conduct of everyday encounters; and second, these studies consider prosody and the emotions it conveys independent of the particulars of the social action, stance, stake, and so on that are relevant to the participants. These limitations do not acknowledge that since the early part of the last century emotions have been recognized as constitutive of societal, motive-driven activities, individual goal-directed actions, and unconscious, contextually conditioned operations. In this study we are interested in how prosody is pragmatically deployed in face-to-face interactional work that brings about successful and unsuccessful lessons.

Prosody, power and control

Prosody is a transactional resource for arguing and negotiating differences. In a study of children playing hopscotch, prosodic features such as increased loudness and significant increases in pitch were resources for articulating opposition and difference (Goodwin et al. 2002). Opposition was marked by extended vowel length, raised pitch on negatives, and distinctive pitch contours. There were differences between African American and Latina girls, whereby the latter used dramatic intonation contours and the former used less extreme pitch maxima, less pronounced pitch contours, and less durational expansions.

Particular phenomena of societal transactions are associated with prosodic features of communication. Thus, abrupt-joins—i.e., within-turn changes in the trajectory of conversational topics—are associated with increased pitch, increased speech rate (temporal compression), continuity in articulation and close temporal proximity, and increased loudness across the pre-abrupt-join and post-join talk (Local and Walker 2004). Pragmatically, the abrupt-join constitutes work, the outcomes of which are resources that secure a speaker space for more talk beyond the completion of the unit with the pre-abrupt-join utterance. As a result, a speaker can achieve a multi-unit turn at talk. The prosodic features also can contribute to securing the shift in the transition to another topic within the same turn at talk.

One study shows how during the production and reproduction of interviews—involving a variety of celebrities and CNN’s journalist Larry King—differences in the power and status of the interviewee are associated strongly with convergent features of their pitches (F 0) (Gregory 1999). Interviews rated (by undergraduate students) as high in power and potency—on the basis of such factors as loudness, dominance, toughness, aggressiveness, hardness, activity, strength, and intensity—also show the highest convergence values with respect to the pitch levels of the participants’ speech utterances. These results suggest that as the interviewer and interviewee adjust to relative power or status differences, their F 0 (pitch) levels converge. It appears that pitch levels and contours contain important social information that transaction participants use as resources for conducting activities, and accommodation to status differences (Szczepek Reed 2006). In a similar vein, the analyses of tape-recorded interviews with enlisted airmen show not only alignment of prosodic parameters, most notably pitch levels, but also an increase in the quality of the interviews, which generally was associated with the dominance of one interview partner (Gregory et al. 1993).

Methods

In this study, our aim is to understand how certain social events in science lessons of different degree of success arise from face-to-face transactions and interaction ritual chains. We follow others interested in the emergence of sociological features from mundane face-to-face transactions in combining careful ethnographic work with conversation analysis with a specific extension of this work through the computer-aided analysis of prosodic features. The data for our study derive from a 7-year ethnographic effort to understand teaching, learning, and learning to teach in an innovative approach of preparing teachers for working in inner-city neighborhood schools. In the particular situation, we investigated coteaching, an organization of the usually solitary teaching job such that two or more teachers take full responsibility for all aspects of teaching a particular course. In our case, coteaching was used at an Ivy League school as the model for science teacher preparation. New teachers, as the teacher interns are referred to, were assigned to not-too-distant inner-city schools, and in especially large numbers to City High, an urban school located near our university in a large city of the northeastern US, where they taught in configurations from two to four, sometimes on their own, but often in direct arrangements with a resident teacher.

Institutional and ethnographic context

City High has more than 2,000 students, 98 percent of whom are of African American descent and more than 90 percent are from poverty-stricken or working class families. City High is organized into small learning communities, schools within the school, each including about 200 students and organized around a different core idea that organizes the curriculum (e.g., health, sports, or science and technology). The average daily attendance rate is 72 percent, that is, on any given day more than one-quarter of the students are not at school for a variety of reasons often directly related to poverty and other home conditions. Achievement on state-level examinations is consistently below average.

The research at City High, which is part of a larger study of science education in urban high schools, commenced in 1998 when a relatively small group consisting of the authors of this paper and several graduate students began to study the teaching and learning of science in urban high schools. The research group was dynamic, consisting of a core of university- and school-based researchers and others who joined the group for periods ranging from weeks to as long as a year. Steadily the size of the research group grew to include different categories of investigators—university professors, visiting scholars, post-doctoral scholars, doctoral students, and numerous school-based researchers including resident teachers, new teachers, high school students, and, in one instance, a school administrator. The methods employed included ethnography and microanalysis using digitized video.

Learning to teach through coteaching

Coteaching is a method to induct new teachers into teaching by working at the elbow of a more senior teacher. This study was designed to understand the microprocesses that lead to successful and unsuccessful coteaching, where we generally observe fluidity after the participants have worked closely together for several months.

Data sources

In this interdisciplinary study, we draw on a variety of qualitative research methods appropriate in school contexts, including ethnography, conversation analysis, and microsociological approaches to studying social practices. In addition to writing the usual observational, methodological, and theoretical field notes, we videotape lessons and dialogue with the teachers and students, interview students and (new) teachers, audiotape interviews conducted by high school student research assistants among their peers, and collect the teaching-related discussions new teachers held using an online internet forum. Teachers often are equipped with recorders to ensure that their talk is captured as they move about in the classroom and we also place recorders on various student desks to allow us to record as many contributions to whole-class conversations as possible.

We draw our specific examples from the studies of two sets of new teachers assigned in consecutive years to coteach at City High with the same resident science teacher who, in this paper, is referred to by the pseudonym Alex. Of Cuban–African origin, Alex was in his second and third years of teaching at City High—years in which he was regarded by students and colleagues as an effective science teacher. However, he was not always effective in this way. During his first year at City High Alex experienced significant difficulties in teaching his classes despite 5 years of successful science teaching experience in urban schools in the Southeast of the US, and his cultural history that included living in large metropolitan areas in the Northeast.

Chris (white), the new teacher involved in the first ethnographic study featured below, was seeking certification in biology and chemistry. He had a B.Sc. in biology and after 2 years of graduate work in biology, decided to pursue a teaching career. The class involved in this ethnography was a general chemistry class for tenth-grade students. The class was situated in a small learning community in which most students were low performing and not seeking university admission. Passing the course satisfied one of four science credits required by the school district for high school graduation. The class had 24 students (12 males and 12 females), but the actual numbers on any particular day varied considerably given the high rates of absenteeism. These students had already had science with Alex during the previous year. Perhaps for this reason, a minimal number of behavior problems were observed in this class. Nevertheless—likely mediated by poor attendance related to home and other out-of-school situations—student grades generally were poor, and most students were quite weak in mathematics and reading, as indicated by their very low reading and mathematics scores on state standardized tests. The excerpts presented in this paper derive from a lesson that occurred about 3 months into the coteaching of Chris and Alex. They had planned to have a brief preparation period, in which they would prepare students for laboratory activities in which they would construct and conduct measurements with electrochemical cells.

The second ethnography involves Victoria (Filipina-American) and Jessica (white), both with strong backgrounds in science and seeking certification to teach chemistry. This research occurred in the year following the ethnography with Alex and Chris. The small learning community in this study was for students with an interest in college admission after their completion of high school. Coteaching occurred in the chemistry class Alex taught during the first period of the day, which, scheduled in a block, met daily for 96 min during the fall semester. The class of 29 students had approximately equal numbers of males (15) and females (14), and consisted mostly of African American students (24) in their junior year. The classroom was spacious, the result of combining two formerly separate classrooms. The students’ individual desks were arranged centrally to face what was originally the rear of both rooms during instruction time and laboratory tables were situated to the left and right of that central area. A demonstration desk and chalkboards were behind the students and inaccessible as instructional tools. At the new front of the class, there were two large rectangular whiteboards adjacent to one another and next to them Alex strategically positioned an equally large periodic table. An overhead projector and a pull-down screen were situated between the two whiteboards. This structural arrangement of resources afforded a particular style of whole class transactions that invited the coteachers to coordinate their verbal utterances and gestures to what they wrote on the whiteboards, projected onto the screen, and found salient on the periodic table.

Data analyses: general

Each researcher of the larger research team had access to a basic tool kit that included a computer loaded with video editing software, video and audio recorder, digitizer to convert analogue video to digital files, and external hard drives on which to store large files. All researchers had access to data resources from a databank that was maintained by a project manager in a central research office. A password protected website was maintained for storing video- and audiotapes, transcripts, analytic memoranda and drafts of papers. Transcripts of video vignettes and audio files were shared upon request and to facilitate cross-school comparisons a meeting of the full research group occurred for 2 h once a week. As data resources were produced they were made available to all members of the research group. For example, Alex has a hard drive with copies of all videotapes produced in his classroom and he uses these resources in his own ongoing research (conducted to obtain a doctoral degree).

Any member of the research group could identify excerpts from the videotape that were worthy of further investigation. These segments were produced as individuals and groups watched tapes and saved salient vignettes that varied from approximately 30 s–3 min. The resources used in this study were identified as salient in different ways. Those involving Chris and Alex arose from ongoing ethnography in which it was observed that when teachers cotaught they became like one another by being with one another. Some of the common features included verve and the way the teachers spoke. Ethnographically it was evident that Chris and Alex were becoming like the other (Roth et al. 2005) and we endeavored to produce compelling data through microanalysis to test the assertion that Chris and Alex had similar prosodic features. From our field notes we had observed that Alex used terms such as “really really” and “right right” in distinctive ways. We also observed that Chris began to use these terms in what sounded like the same way. Accordingly we searched the database for segments in which Alex and Chris used these terms, while teaching together and separately.

In the analyses below, an episode involving the student Mirabelle and the new teacher Victoria features prominently. We selected it because two student researchers identified the vignette as an example of what they considered to be effective student learning. The two students extracted the vignette and made it available for the research group who undertook intensive analyses. The reasons offered by the student researchers to support their selection were that Mirabelle had a “trick” to work out chemical valence using the periodic table. By sticking to her guns she showed the class that her theory of valence did not always work. The teacher and the class collaborated to produce opportunities for all students to learn more about chemical valence and periodicity.

Until 2003, the researchers from City High, including Alex, a school administrator, several student researchers, the university professors, and graduate students, met at least 3 days a week at City High and in the research office. Since then, because the research at the City High site is ongoing, the researchers (including several of those who were student researchers) communicate as necessary using email, telephone, and videoconferencing. Face-to-face meetings also continue. Because Alex and the school administrator from City High are completing doctoral dissertations based on their research on the teaching and learning of science at City High, they meet several times a week to collaborate on their research and they also meet regularly with the second author of this paper, who is their supervisor.

Data analyses specific to this study

In this study, we used Interaction Analysis, “an interdisciplinary method for the empirical investigation of the interaction of human beings with each other and with objects in their environment. The research investigates human activities such as talk, nonverbal interaction, and the use of artifacts and technologies, identifying routine practices and problems and the resources for their solution” (Jordan and Henderson 1995, p. 39). We began by analyzing data collectively, in sessions lasting about 3 h, working second by second through the tapes. Our analyses proceed in the following way. The researcher currently at the controls runs the video until someone (we allow others to participate in such sessions for the purpose of learning to analyze data sources) requests halting the video to talk about a feature or episode—usually the episodes have a duration of somewhere between several 100 ms to the order of 10 s. The person requesting the halt points out what he finds salient, describes and interprets the episode, and generates hypotheses to be tested in the remainder of the same tape and in the remainder of the database as a whole. The other author also provides his description and interpretation. The episode is discussed until both analysts feel that there is nothing more to say about it at that moment—though subsequent periods of writing often turn up additional features, which are discussed during the next meeting on the following day. In this way, we work image-by-image through the video and, correspondingly, line-by-line through the transcript. When appropriate—e.g., when there is transactional trouble—we hypothesize what might happen next before moving onto confirm or disconfirm our hypothesis. We then spend the following hours writing individual analyses, which we share with one another, comment upon, confirm or disconfirm in the remainder of the database. On the following day, we continue our collective analysis, both generating new hypotheses and categories and, simultaneously, (dis-)confirming existing ones. Once we have constructed hypotheses about possible features, we test them in randomly selected materials from the database as a whole. If there is contradictory evidence, then we revise the hypothesis and test it again in the data source materials.

Consistent with hermeneutic theories concerning the interpretation of texts and actions, we know that our understandings gained by living in the situations (as ethnographers, teachers) enable our interpretive efforts; and our interpretive efforts lead to explanations that enhance and elaborate our (intuitive) understandings. The two processes enacted in interpretation—i.e., understanding and explaining—stand in a dialectical relationship, each presupposing the other. Ethnomethodological studies in particular suggest that without an understanding of the situation from which data are culled, their classification and measurement is impossible.

Conversation analysis

All relevant video are digitized to make them available for analysis using iMovie HD and the professional version of QuickTime Player (Macintosh OS X). The software allows us to slow down and speed up the recording, which we interpret image by image to capture phenomena at the microlevel, where we often observe patterned actions that the speed of everyday activity do not allow us to observe and become conscious of in real time. The recorded events are transcribed and are augmented with selected salient video frames. The audiotapes of classroom events, interview sessions, and debriefing meetings are also transcribed and made available for analysis. In the school setting, the first transcriptions are often completed by the high school student research assistants, because of the high fidelity with which they capture student contributions to the conversations in the science lessons.

Given the central role of face-to-face transactions as sites of society-producing interpersonal behavior, conversation analysis is a method of choice for sociologists to study the pragmatics of the production and reproduction of societal formations—such as school science classes. Conversation analysis as a method of inquiry has had a considerable history; the inclusion of prosody in the analysis of transactions, however, is more recent and much less common. Face-to-face transactions are ritualistic in nature and integrally tied up with the emotional states of the participants; this ritualistic nature and the attendant emotions appear to be produced and reproduced by means of rhythmic synchronization that can occur at different temporal scales. Within-group synchrony and between group asynchrony may in fact constitute those resources other researchers seek in building theoretical models to understand how affect mediates cohesion within a group of people and their differentiation to the outside. Because prosody—together with facial and other bodily displays—constitutes an important avenue for articulating and making available emotions to others (Roth 2007a), the analytic method that integrates traditional conversation analysis with precise measurements of prosody has become an important tool to sociologists who study the pragmatics of the production and reproduction of culture.

Data analyses special: prosody

The analysis of speech parameters is a powerful tool in the hand of ethnographers who study naturally occurring situations including the emotions and changes thereof that transaction participants make available to one another. To conduct the quantitative/qualitative analysis of speech (and its relation to other perceptual features in the setting), the transcripts of selected lessons are enhanced to contain information regarding sequencing (overlap/latching), timed intervals, characteristics of speech production (stress, lengthening of phoneme, intonation, loudness), and comments. At this stage, the software packages Peak DV 3.21 and PRAAT (www.praat.org) are used to work with the soundtrack, to amplify the volume to improve the hearing of doubtful words, to measure the pauses using the waveform display of the sound, and to establish pitch levels (F 0), means of the first formant (F 1 mean), and pitch contours, which all are clearly visible in the PRAAT display of a particular sound. When errors in the determination of pitch by the PRAAT algorithm are apparent, we obtained the correct pitch by zooming into the sound spectrum until the pitch can be determined by means of visual inspection that identifies the repeating structure in the waveform and by making a length measurement by hand.

To understand pitch, consider that each sound is produced by a superposition of a range of frequencies. The fundamental pitch (F 0) is the lowest frequency produced by the vocal cords. It is the strongest correlate with how a listener perceives the intonation and stress given to words by the speaker; it also is the strongest correlate with psychological and sociological features of a situation.Footnote 1 Formants, that is, vocal resonances, are the result of constriction and length variation in the vocal tract to differentiate sounds, in particular diphthongs and vowels. F 1 mean is the frequency of the first formant (constituting significant energy concentration in the spectrum) averaged over an utterance. Although largely produced to differentiate sounds, a review of theoretical models shows that that changes in F 1 means are generally expected to correlate with specific modal emotions including irritation and anger (Scherer 2003), which prominently feature in one of our analyses. Following ‘t Hart et al. (1990), a perceptual approach was chosen to the study of intonation both with respect to categorizations of natural speech and visual characterizations of pitch contours. Not unlike experimental studies, the speech in the situations of interest is categorized by judges (often university students) in terms of a set of emotions that have been explored in psychophysiological studies. Quantitative information about speech parameters is obtained independently through the analyses using PRAAT. Our observations and categorizations then are compared with the results of existing research made available in literature review studies. Interpretative convergence means that the quantitative and categorical assessments in this study are consistent with those provided by experimental (psychophysiological) studies.

To conduct a direct analysis between the pitch contours of two speakers, several occurrences of the two forms of the same words for both speakers were normalized for time of articulation and adjusted for the different beginning pitch level (which speakers always adjust to context) in the few instances that this was the case. The values for each person and each form of the word were averaged and plotted. Normalization is necessary for both time and pitch to be able to compare pitch contours across contexts and speakers. Time-normalization is necessary to adjust for the different speech (delivery) rates of different speakers or the same speaker in different contexts. Normalization in terms of beginning pitch is necessary because absolute pitch levels change according to the context—for example, when speakers adjust their absolute pitch to that of the previous speaker. For statistical comparison of two pitch contours, the resulting curves are sampled at or near the points on the temporal axis where PRAAT had extracted pitch from the sound spectrum.

To conduct comparisons of speech articulation rates of the same speaker within and across utterances and situations, the standard measure of syllables per seconds was used. Information about speech intensity was culled from PRAAT, with special attention to changes in speaker orientations and distances, which would change the intensities that reach the camera microphone, making impossible any inter-situation and inter-word comparisons. Speech energy and speech power are measures that the software package provides.

Production and reproduction of prosodic alignment

With Émile Durkheim and the beginning of their field, sociologists have recognized the central role that individual and collective emotions play in producing and reproducing society. Fundamental to understanding emotion (affect) as a central moment in the making of society is the double role prosody plays in face-to-face transactions: it is both resource for acting in a situation and an outcome of actions produced by others and oneself. It is, among other bodily expression, through prosody that transaction participants communicate their orientations and the effect that another’s prior action had on the Self; and for sociologically oriented (Marxist) linguists it is through prosody (intonation) that participants make available to one another social evaluations and ideology (Bakhtine 1977). In this section, we exemplify our general findings drawing in a small number of cases reflective of the universe of data analyzed. The results show how, after working together for several months, participants in coteaching produce and reproduce similar prosodic patterns, which are new features at least for one of the participants (both new and resident teacher). Our ethnographic and interview data show that when the individuals involved are asked about their relationship with their co-participants, they generally express high levels of agreement and solidarity. On the other hand, in those situations where prosodic alignment does not occur, we also find participating teachers talking about the conflict-laden relationships they have with their coteaching partner.

Two main features with different temporalities were observed in the reproduction of prosodic patterns: inter-individual reproduction of prosodic patterns over time (of working together) and alignment of prosodic patterns within speaking turn pairs. Both features play significant roles in the communication of emotions but do so at different temporal scales. The first phenomenon was observed as part of the reproduction of (classroom) culture, as newcomers to the field of teaching come to speak like those who have been teaching in a particular school for longer periods of time. The second phenomenon is characteristic for the process of producing specific societal formations by means of transactional face-to-face work, and therefore, for the moment-to-moment production of conflict and alignment.

Inter-individual reproduction of utterance/prosody features over time

As part of our ethnographic observations, we noted that new teachers working for some time with experienced teachers tend to “pick up” expressions, grammatical features, and intonations. Our interviews reveal that participants are not normally conscious that these parameters of their communicative actions are changing. These observations are particularly striking when we note that different new teachers working with the same teacher (same or subsequent year) change in similar ways, becoming more like the experienced resident teacher. But simultaneously we observed that the resident teachers also changed in the direction of the new teachers they were working with. Here we provide two typical cases selected from a chemistry lesson to show that, after working together for 3 months, teachers exhibit similarities in their ways of speaking. These cases stand in an exemplary way for observations we have made across participating master teachers and the new teachers coteaching with them.

First, Alex uses certain phrases and an oral grammar that our ethnographic studies found to be characteristic of the students’ African American culture but which the new teachers in our database generally do not use when they arrive at the school. However, as they coteach with Alex, the new teachers, independent of their cultural background, begin to speak using these phrases and grammar. For example, in communicating with his students, Alex frequently uses “really, really” instead of “extremely” (e.g., “really, really hot” when referring to the temperature of a copper coin being heated with a blow torch). Using these speech features as tracers, we find that in any given lesson Alex uses the phrase frequently and it is customary for coteachers in the class to begin using the term with increasing frequency in their transactions with the students. That is, although they are not aware of the process, new teachers “pick up” this way of speaking from Alex during their coteaching experience: the coteachers’ oral texts become similar to one another and are of a type that encapsulates efforts to communicate, educate, and maintain the energy and rhythm of their teaching. This is precisely what we might have expected from the perspective of a Marxist philosophy of language, where each utterance in “verbal interaction” is considered to be social and ideological through and through, including its structure and intonations.

A concrete example in the following two transcripts shows how Chris, who is white, somewhat overweight and laid back, acts like Alex, who is dark skinned, lithe, athletic, very bubbly, and energetic—exhibiting similar levels of verve to many of the students. Here, the speaker makes an assertion, pauses and seeks affirmation from the class with a rising and then horizontal inflection and rising volume (’rIGHT-). The pitch rises sharply, and the utterance stops suddenly and sharply. Pauses, generally longer before than after, surround the utterance. This pattern can easily be discerned (a) on the graphically displayed pitch spectrum of the sound and (b) in the speech intensity display, where higher amplitudes characterize louder talk (Fig. 1a).Footnote 2 This figure also shows that the surrounding context—longer pause preceding the word, a shorter pause following the word, and a jump of the pitch for beginning the subsequent talk—is nearly identical.

figure a
Fig. 1
figure 1

After working with Alex for a while, Chris has “picked up” some of the utterance features characteristic of Alex’s speech. a Alex utters “right” preceded by a pause and followed by brief pause and a jump in pitch (black circle) with the next utterance; Chris exhibits the same speech-pause-speech and pitch contour pattern (“right”). b A direct comparison of time- and pitch-normalized articulations of the two forms of “right” shows how similar the speech productions of the two have become

The direct comparison with Chris provides evidence for the pausing immediately prior and subsequent to the utterance of “’rIGHT-”; and there are nearly identical pitch and intensity contours (Fig. 1b). In most instances, the pitch tends to begin between 107 and 112 Hz (which is considerably below Alex’s normal pitch range of 150–180 Hz), rises rather rapidly, and then plateaus for both speakers near 144 Hz. To obtain a measure of the similarity of the contours, we averaged for each person 30 incidences of the target utterance, norming the absolute pitch height and the length as indicated in the methods section. The resulting two profiles were correlated. The statistical correlation of the two contours provides evidence for their high degree of similarity r(29) = 0.96, p < 0.0001. As part of our ethnographic work, we have heard this pattern of speaking in Alex’s teaching prior to him coteaching with Chris. It is apparent that in Alex’s successful coteaching experiences, new teachers adopted this pattern of speaking while coteaching with him. In this lesson, Chris exhibits a similar way of teaching and exhorting students to think about what was asserted.

There is a second type of context in which the word “right” appears. In the following two transcripts, characterized by a less sharp rise and less sharp stop, the utterance either immediately follows a statement or immediately is followed by a continuation of the sentence. Following the previously explained method of processing and comparing, we find again a high degree of similarity between the two pitch contours, r(30) = 0.86, p < 0.0001 (Fig. 1b). The pitch begins much higher than for the second type of articulation of “right,” rises less sharply, but generally ends at the same pitch level (about 144 Hz). This pattern marks the end of a declarative statement (“you see this and that’s an element all by itself (0.12)’right”) and is an indication that the utterance is going to continue even though there may be an extended pause (e.g., “’right (1.52) and anything else”). This pattern, too, is one that has characterized Alex’s speech for many years—though it has become accessible to us quantitatively only recently after finding suitable software—and Chris uses this pattern now without, as he indicated repeatedly during interviews, having been aware of acquiring this manner of speaking (Fig. 1b). The prevalence of this feature is evident in the following two extended turns at talk that each teacher had in a chemistry lesson concerning elements, chemicals, and compounds.

figure b

Our ethnographic study, which has followed many coteaching configurations in the course of the 7-year project, shows general convergence in these and other patterns of speech in other coteaching combinations as well. This convergence can be observed even in the case of cultural differences between participant teachers, as evidenced in the present examples. In the few instances where we observed conflict-laden relations between coteachers—often instances where we have had to reassign the new teacher to work in a different class—such convergence did not occur.

These findings are consistent with earlier research, according to which individuals with less institutional power and status adjust their speech parameters generally and their prosody in particular to the person with more power and status. The results also are consistent with the research that shows a lower perceived potency of person–person transactions when pitch convergence was not observed (Gregory 1999). However, in the case of coteaching, the situation is more complex as there are teacher–teacher, teacher–student, and teacher–teacher–student transactions. Accordingly, a research hypothesis we pursued in this study was that new teachers become like their coteachers because they are entrained into the same rhythmic patterns that are characteristic of the exchanges between the regular classroom teacher (here Alex) and his or her students. We had already reported the entrainment into movement patterns and the use of physical resources that occurred in coteaching teams (e.g., Roth et al. 2004a, 2005). A likely candidate explanation for the entrainment into vocal patterns to be explored by further research is that the students and regular teachers are and have been part of the school culture for longer periods of time, whereas the new teachers newly arrive. It is more likely that the newly arrived will lock into existing frequencies, because, to use an analogy from physics, his or her “inertial mass” is much smaller than that of the existing group (class, school culture). (Two coupled harmonic oscillators, like two pendulums, experience frequency shifts that take into account their relative masses, where the lighter oscillator moves more than the heavier one.) We explore this hypothesis in the next subsection.

Alignment of utterance/prosody features within speaking turn pairs

When new teachers first come to the inner-city schools—even if they are African American like the students in this study but from a different area in the US—they frequently find themselves in conflictual situations with students. Numerous studies undertaken in Philadelphia and New York City show that teachers and students experience difficulties in producing environments that are conducive to learning science. A challenge for all participants is to anticipate the culture of the other and produce forms of practice that expand collective agency. Understanding others’ culture appears to take time and effort and in some instances classes can become dysfunctional as teachers and students give up on trying to produce interstitial culture that might otherwise produce success. Even Alex struggled during his first semester at City High. He described himself as an “Anglo New Yorker” who had developed “amalgams of himself,” the person he considers truly to be, and the Black, Hispanic teacher that he became while teaching in an urban school in Miami. Although he had initially experienced difficulties (“With the Black kids from the ‘hood, my ‘black man’ was a total fiasco”), Alex eventually learned ways of teaching and disciplining effectively. As Alex prepared to teach at City High he considered himself ready for his new assignment. He did not anticipate the struggle that eventual success would entail. However, the presence of new teachers in his classes proved beneficial to Alex, who commented that:

The presence of the two new teachers from the university saved me as I had colleagues with whom I could coteach. They gave me insight into the teaching, planning, and structure of the classes. They gave me breathing space. With them teaching, I could look at the class and attempt to solve the discipline and teaching problems. I was able to forge relationships with the students during the class; I would take students aside and ask them why they were behaving in a given manner, or if they were learning. I was able to see the weaknesses in our approach and in our content. I would not have survived without the new teachers.

Our mesolevel ethnographic descriptions in problematic classrooms often contain the everyday terms “out of sync,” “disharmony,” and “head butting.” It turns out that the pitch levels of these teachers are different from those that their students display; one does not observe the continuity or similarity in other prosodic parameters that is observed once more harmonious relations are established. For example, Fig. 2 displays the comparison of the pitch levels involving Victoria, an immigrant new teacher of Hawaiian origin, and Mirabelle, an African American student. Typically (and depending on the base pitch of the two speakers), a mean difference of greater than 100 Hz is observed between the two speakers. Such differences also are observed in a harmonious classroom when individuals articulate differences of fact or opinion (see below). In the present instance, there is no convergence, even though, based on research linking power and status to prosodic features, there is a possibility that the lower power and status students ought to align with the teacher, who institutionally has more power (e.g., getting a student suspended). In a functioning, “harmonious” coteaching relationship, however, we do observe different patterns of development.

Fig. 2
figure 2

Victoria, an immigrant new teacher, is in conflict-laden relationship with her mentor teacher Alex (Cuban African American). Her pitch (open square box) does not align with his, nor does she “pick up” features of his speech—the mean difference between the two speakers exceeds 100 Hz. The differences are also observable in interactions with students, here Mirabelle (black circle), where pitch continuity observable in successful classes does not appear

The combined evidence from our ethnographic work and the analysis of speech provides evidence that in functioning and low conflict or conflict-free coteaching configurations, coteachers tend to become like one another—including, as we show here, in their prosody. This is particularly observable in the way a seasoned teacher accommodates students by enacting pitch continuation to match the students’ pitch rather than the other way around—as one would have expected from earlier research (Gregory 1994). When difference occurs in this type of situation, the pitch remains about the same from one speaker to the next, or moves into a lower register.

In teacher-student transactions, both within and between lessons, Alex (nearly) always seems to find the “right tone” in diverse situations. In addition to the semantic (word choices) and grammatical features of his language, which align with those of the students’ culture, an analysis of pitch in consecutive turns shows that Alex (unconsciously) matches prosodic features, including speech intensity and pitch to the preceding turn produced by a student. An example of such matching is provided in the following episode, which pertains to the conversation about the relative hotness of candle and welding-torch flames.

figure c

Following a student utterance, Alex queried to understand the student contribution, “when we did the candle?” Alex’s pitch has been rising from 150 Hz to end at 230 Hz. However, the student answered at a much lower level, moving from 110 to 130 Hz in the course of uttering “yea.” When Alex continues after the student, Alex (likely unconsciously) first matches his own pitch to that of the student at 130 Hz before returning to his own preferred pitch level in the present context between 180 and 190 Hz. Furthermore, the intensity levels show a similar pattern. The peaks of Alex’s intensities moved from 64 dB to 74 dB but, after the student responded by speaking with a much softer voice (60 dB), Alex decreases speech volume to 64 dB and moves to 67.5 dB for each of the three syllables of the utterance “ca = andle.”

In our database, the same patterns are observed for Chris after having cotaught at least one lesson per day with Alex for over 2 months. In episode 6 Chris engages a student about heating a penny. The student suggests that holding the penny in the flame for a long time would lead to a different result. Chris’s utterance of “right” (turn 02) matches the student’s pitch at 216 Hz both during the overlapping word “right” (an acknowledgment of understanding and an indication by the recipient that he or she is listening, while indicating that the speaker may continue) and in the subsequent response, where Chris suggests that the flame has a specific temperature and however long one holds the penny, it will not exceed the temperature of the flame. Pertaining to the intensity, he picks up at about the mean of intensity for the last four words of the students (71 dB) and finishes at about 63 dB.

figure d

Here, both teachers adjust their volume and pitch to match the levels of the preceding student; such adjustment occurs even in those cases where students speak at a base or context-related pitch that differs by as much as 100 Hz from that of either teacher. This contrasts the observation exemplified in the case of Victoria, whose pitch never aligned with either that of the students or her fellow teachers. In the case of Chris, this ultimately leads to pitch continuity across multiple turns involving teachers and students; this phenomenon of multi-turn pitch matching is exemplified in episode 7 and the associated representation of pitch in Fig. 3a. In this situation, Chris affirms the statement that a student has made, matching his pitch to the pitch with which the student ends. Chris descends with the pitch nearer to his normal range (turn 02), followed by Alex, whose pitch rises toward the end of the utterance (turn 03). Chris picks up from Alex, initially matching his pitch and then increasing the pitch as he emphasizes the terms “really” and “hot” (turn 04).

figure e
Fig. 3
figure 3

a In the course of coteaching with Alex, Chris came to align pitch levels with students and Alex alike. b Here, pitch discontinuity signals disagreement on matters of fact and opinion against a background of pitch levels always adjusted to the preceding speaker

In this situation, we observe prosodic complementation (Szczepek 2001), which, according to Beatrice Szczepek Reed describes:

another way in which participants have been found to collaborate intonationally: a first speaker has produced a contour which in itself is complete, but we expect it to be followed by a particular contour from the next speaker. Both contributions constitute complete turns respectively. However, although the first participant’s turn signals turn completion prosodically, syntactically and pragmatically, the second contour seems to complement the first so that the two together form a prosodic pair. (p. 36)

In this example (Fig. 3a), three speakers, a student, Chris and Alex exhibit prosodic complementation (turns 1–4 with a duration of less than 3 s). Although each contribution could be considered a complete turn, successive speakers anticipated the previous speaker’s prosody and produced complementary utterances–examples of entrainment that included alignments in pitch, intensity and content, which are observed clearly in prosody of the end-start pairs (S–C; C–A; A–C). Furthermore, when Alex chimes in again, he confirms the content of Chris’s utterance by repeating it, and signals and produces alignment and agreement by continuing the pitch contour. Discord and difference, on the other hand, would have been indicated and produced by significant pitch and intensity differences with the previous speaker.

This matching process contrasts, for example, the earlier mentioned rapid rise in pitch when Chris changes from an initial agreement with a statement to a disagreement. That is, the sudden rise in pitch signals an opposition between two ideas. Alex and, being entrained to him in this feature, Chris exhibit patterns of pitch matching: unless they express displeasure and opposition to the current situation. Thus, at the beginning of the lesson, Alex called the class to order, “Excuse me! (0.20) Hey!” In this utterance, the pitch rises from about 90 Hz on “ex” to 290–315 Hz on “cuse.” This enormous rise in pitch clearly dramatizes the utterance, emotionally coloring the social situation thereby evaluating and foreshadowing the kinds of transactions that may follow. The pitch then rises from 137 to 215 Hz on “me,” which can be heard as an exclamation. The subsequent “Hey!” rises from 275 to 348 Hz. In both situations, the high—when compared with Alex’s normal pitch level between 150 and 180 Hz—is a resource for understanding him to express disapproval.

A final example of Chris speaking like Alex is seen in turns involving both teachers and some students. The pitches are matched to provide continuity between speakers. When difference is articulated semantically, the corresponding pitch levels tend to be much lower, in or near the speaker’s normal range. In the following episode—almost entirely represented in Fig. 3b in terms of pitch and speech intensities—Alex engages in an exchange with students about whether the different elements in a compound can be separated. Several students call out words and fragments, some apparently negating others affirming the possibility of separating the elements by chemical means. Alex expresses reservation and opposition twice (turns 05, 07). In both instances, Alex’s pitch is much lower than that displayed by the student (Fig. 3b).

figure f

When Alex eventually provides the correct answer to his question (turn 09), his pitch moves way out of his normal pitch range to over 200 Hz, which is where the last student speaker (S2, Fig. 3b) has ended. The figure shows that after producing a pitch continuation above 200 Hz, Alex’s pitch returns to his normal range, with occasional spikes while producing emphasized syllables. It is this production of pitch continuity that new teachers, with few exceptions, are beginning to reproduce after coteaching with Alex for some time. More so, Alex not only matches the pitch levels of students but also reproduces pitch contours that previous speakers produce. This, too, is evident from Fig. 3b, where the student contours on “but you can” and “uh you know” find their correspondence in Alex’s production of “I know I can” and “I can break it.” Thus, in these data generally, there is convergence not only between the two teachers but also convergence and difference between students and the new teacher, which reproduces the continuation and differences in pitch levels the regular (experienced) teacher displays. Most importantly, therefore, teachers match the cultural patterns in prosody. New teachers, independent of their culture of origin, learn to produce these patterns in the course of coteaching with someone else already exhibiting this cultural feature. They thereby come to reproduce existing forms of transactional patterns and institutional relations.

Reviews of the literature point out that cultural similarity has a statistically reliable effect on emotion recognition accuracy; cross-cultural differences in predictive accuracy of displayed emotions, which are highly but negatively correlated with the physical distance between the cultural groups expressing and perceiving the emotions; and there may be cultural differences in the prosodic realization of disagreement across cultures. The evidence from the present work suggests that culturally different individuals not separated in space and participating in the same activity system for longer periods of time may not be disadvantaged in the recognition of emotion. They learn to produce and reproduce prosodic patterns and, with these, the attendant affective orientations to other participants co-constitutive of the setting and its social and societal structure.

The findings reported here differ from the studies concerning the relation between prosodic (pitch) features and power and status: the supposedly more powerful teachers aligned themselves with the supposedly less powerful students. But these results are consistent with an unpublished study following a clinical psychologist through her work with 24 schizophrenic patients. The results showed that with training and experience, her pitch values increasingly converged with those of her patients (Gregory 1999). That is, as the quality of the doctor-patients transactions increased, so did the convergence of the pitch (F 0) values. On the other hand, for doctors in the course of their training in specialties, the distances increased, which has been interpreted as a decreased focus on interpersonal relations and an increased focus on the technical aspects of the case at hand. A candidate explanation may be that in the service professions (teaching, healthcare), providers may increase their effectiveness when they attune themselves to the recipients of the service, who, having the sense that the service provider really listens (exhibiting empathy or solidarity), better respond to the service. The present results are consistent with the hypothesis that periodic features—pitch, pitch contours, and rhythm—support the phenomenon of social entrainment. Thus, when one speaker uses a certain pitch, pitch contour, or speech rhythm, theories of social entrainment predict that the subsequent speaker will “tune in,” at least in situations of agreement, and perhaps to differ in situations of disagreement (e.g., Goodwin et al. 2002) or lack of (cultural) attunement. As Szczepek Reed noted, “Prosodic orientation thus seems to create a structural bridge between two turns which could not be achieved by verbal material alone” (Szczepek Reed 2006, p. 90). That is, prosodic orientation, like the semantic, syntactic, and contextual features of a situation, signals a common understanding and can be considered as a bridge between two turns that is a resonance event that is part of a dynamically evolving structural flux.

Heating up and cooling down classroom environments

There have been suggestions that convergence of long-term average spectra in general and those of pitch in particular is characteristic of conversations involving participants institutionally located such that they are said to have different degrees of power. In the previous section, we present evidence that working together over longer periods of time (about 3 months) leads to a convergence of speech patterns and prosodic displays. Prosody, as other nonverbal communicative means, is a transactional resource deployed for pragmatically dealing with issues at hand. If this were the case, then the divergence and convergence of prosody would be associated with practical action in more complex ways than considering only differences in institutional positions. The event analyzed in this section exhibits and exemplifies precisely this feature: increasing pitch levels when differences appear in the content of talk—rather than the decreases we observed—then a “heating up” of the situation is paralleled by rising pitch levels, increased speech volume, and increased speech rates. When speakers increase their pitch levels, speech intensities, and speech rates over the previous speaker, then they “heat up” the situation and “up the ante,” literally trumping the commitment made before with higher prosody values. On the other hand, speakers calm the situation when their conversational contribution is produced with lower speech volume, pitch, and speech rates. In fact, the changing nature of face-to-face transactions involving periods of conflict and periods of agreement within the same conversation is a natural laboratory for studying (i.e., for a single-case) the role of emotions expressed by prosodic and other nonverbal means on the production and reproduction of the specific situation at hand and, of society more generally. The changes in prosody with respect to conflict and its resolution are articulated and exemplified in the following episode—one of numerous that we could also have selected to exhibit the same patterns—in which Victoria entered into a conflict with another person (sometimes the regular teacher, sometimes with students). The sense of solidarity that exists among students, documented in our ethnographic and interview data, is shown here as being exhibited in rhythmic alignments of the listeners’ body movements and speaker’s voice.

In a nutshell, a student (Mirabelle)—always intending to be involved, checking on homework, wanting to pass the course, and avoiding detentions that would keep her off the basketball team—has an idea and attempts to explain it. Victoria, the new teacher, challenges the explanation and the student enacts an argument ritual that is similar to how she might argue if challenged by someone outside of the classroom. We therefore have a mutual focus of attention but the transient emotions (mood) are not shared. Continued prosodic misalignment and the correlative expression of anger feed back and aggravate the situation thereby producing conflict. Although Alex is coteaching and in the classroom at the time of this episode, he does not centrally participate in this series of episodes from the same event. The transactions involving students and Victoria are most salient in this vignette, especially transactions between Victoria, Mirabelle, and three of her peers. Mirabelle sits near the back so that most students cannot see her without turning around (Fig. 4).

Fig. 4
figure 4

Seating arrangement of some of the key players in the episode

Just prior to the conflict beginning to articulate itself, Victoria has presented a “trick” to use a periodic table of elements to figure out the number of valence electrons possessed by an atom of a specific element. Mirabelle then announces that she has figured out a systematic way to remember the valence. Victoria reiterates that the placement in the table determines the valence, but Mirabelle counters that this is not what she is talking about. There is another turn pair, in which Victoria points out that what she has said “is the trick.” The transcript in episode 9 picks up at this point (the underlined speech elements are represented in terms of pitch and speech intensity over time in Fig. 5, each alphabetical label refers to the panel of the same letter).

figure g
Fig. 5
figure 5

Classroom conflict and resolution are correlated with rising (“heating up”) and falling pitch levels (“cooling down”)

In a quieter voice than she has used before, Mirabelle restates that there is another way to figure “it” (valence) out (turn 01). After a considerable pause—nearing the 1-s mark (turn 02)—Victoria utters in a determined fashion, “this IS the way to do this” (turn 03). As Fig. 5b shows, the pitch rises by almost 100 Hz during the utterance of “is” and is also accompanied by considerably higher than normal speech intensity. There is another stressed word: rather than falling to the end of the turn, the pitch rises again dramatically to 270 Hz on “this,” thereby reinforcing that her (Victoria’s) way, the one she just explained, is the only way to remember valence, that is, the topic at hand. Far in the background, Alex comments calmly and with very low, almost inaudible, speech intensity, “I hope so” (turn 04), thereby both sustaining Victoria’s claim but also providing a potential resource for avoiding a heating up of the conflict that appears to announce itself. First Tracy and Sasha, then Tasha turn their bodies and heads while Mirabelle is in the process of orienting and situating herself for the account of her method to be produced. She announces the intention to articulate her method uttering “awright” (turn 07), and then explains how subtracting the number “2” from the atomic number of the first row of elements generates the chemical valences of the associated atoms. With the utterance of “awright,” Mirabelle moves her body right to left and her hand into a forward position, and then erects the body again as if taking a position from which to launch the articulation of her method of recalling and remembering chemical valences.

In turning their bodies and heads to look squarely at Mirabelle, the three peers exhibit their expectation that something is to be forthcoming—explicitly articulating encouragement for this something to begin. That is, what has happened so far and how it has happened has led to the expectation that the situation requires some conversational action on the part of Mirabelle. The resources for such an expectation clearly have been produced in the immediately preceding exchange. Within the culture of these students in this school, the teacher perhaps has articulated a challenge, which Mirabelle, to preserve her social capital with her peers, has to take up. Up to this point, however, Mirabelle has stayed calm despite the determination and irritation that Victoria displays in her voice. Contrasting Fig. 5a and c reveals that following the low pitch and low volume intervention on the part of the regular teacher, Mirabelle’s pitch has decreased to below 200 Hz.

Mirabelle has been subtracting two from the atomic numbers in the first full row of the period table of elements (lithium, beryllium, boron, carbon, nitrogen, and oxygen). Tracy is the first to ask Mirabelle in a low-pitched and low-intensity voice, where the two was coming from. Episode 10 begins as Victoria repeats the question but, as shown in Fig. 5d, her pitch rises to near 300 Hz and speech intensity increases above normal levels, especially as she utters “two” (turn 01). There is a brief pause, interrupted by Tasha’s “from” uttered at low speech intensity. At this point, Mirabelle launches her body forward and raises her voice, and her pitch repeatedly moves to 500 Hz and beyond as she utters, “I’m just saying, just do the number two” (Fig. 5e). (Correlatively, and confirming this interpretation of irritation/frustration, the F 1 means—i.e., the first vocal resonance that is associated with the differential pronunciation of sounds—move from 2,727 Hz during the calm presentation to 3,488 Hz.) Mirabelle co-articulates her frustration in her voice (prosody); and the production requires energy. In the present situation, the power in the air (energy in the air per unit time) increases up to more than tenfold (8.4 μW/m2) at the onset of turn 05 (episode 10) over the power normally used in her speech (less than 1 μW/m2). That is, the expression of her frustration in fact drains energy, here used in the production of her utterances.

figure h

Mirabelle continues in an accusatory way saying that her teacher did not want her to copy text, with the possible implication that what she (Mirabelle) searches for is understanding and having a surefire way to remember valence. The frustration also is apparent from the way she uses her body to direct others’ attention, for example, to a particular place in the periodic table (Fig. 6). At this stage, her speech rate has increased from an average of 5 syllables per second to over 8 syllables per second, consistent with the generally observed large increases of this parameter with (hot) anger. As already Durkheim (1893) noted, when a belief is at stake, others cannot question it without bringing about an emotional reaction (more or less violent) against the offender. And this emotional reaction, as we show here, is co-expressed and communicated not semantically or syntactically but prosodically. Appreciation and social evaluation are expressed by means of expressive intonation.

Fig. 6
figure 6

Mirabelle’s emotional engagement can be read, among others, from the way she uses her body, arms and hands, to direct the attention of others

Following the outburst, Tracy, who has turned around to face Mirabelle, suggests in a very low speech intensity that this way does not work for “all of them”; her pitch is more than 200 Hz lower than the final syllable Mirabelle has uttered (Fig. 5f). Victoria restates that Mirabelle’s method does not work for all instances. The pitch returns to 230 Hz and then descends to 185 Hz at the end of turn 09. That is, Tracy’s turn can be heard as a resource that has had a calming and defusing effect. Mirabelle, too, returns to a lower pitch level (near 300 Hz). Her speech rate has decreased to 3.5 syllables per second. The question “well, what?” already appears to be more conciliatory (consistent with an F 1 mean of 1,561 Hz), and this possibility for a resolution can be heard in the prosody. Yet irritation can be observed again toward the end of the utterance (turn 10), where F 0 moves beyond 500 Hz when Mirabelle comes to the point of the number “two,” associated with an increase in speech intensity to above 80 dB (Fig. 5g) and an increase in F 1 mean to 2,626 Hz.

The voice intensity and sharply rising pitch at the end of the previous episode 10 are resources for understanding (interpreting) that the conflict has not yet been resolved for Mirabelle. Further contributions, uttered at lower pitch levels and speech intensities appear to add to the calming effect. Thus, Mirabelle still begins turn 04 of episode 11 above 300 Hz but progressively descends, following the pitch of Gavin (moving considerably above his normal pitch range), which overlaps hers and then drops progressively to a value below 200 Hz (Fig. 5h).

figure i

Following two brief, partially overlapping turns, Victoria begins to explain again that she has provided an explanation to assist Mirabelle in the best way to learn about valence electrons. In this, Victoria begins with a higher than normal pitch (just below 300 Hz) and high speech intensity (83 dB), but descends to near 200 Hz for the second part of the utterance and lower speech intensity (around 74 dB) (Fig. 5i). As if she had realized during the brief 0.40 s pause the potential of her prosody to rekindle the conflict, Victoria’s prosody parameters change to take on lower values that—on a mesolevel—are heard as less aggressive and therefore are more conducive to conflict resolution. Mirabelle has not yet given up on her alternative explanation, as seen from the fact that she proposes a way of using it for the second row in the periodic table of elements, the row that begins with sodium (Na) (turn 13). The pitch level remains higher than normal, between 250 and 300 Hz and the speech intensity repeatedly goes to 80 dB and beyond (Fig. 5j).

Additional students “come to help,” as is apparent from the final episode depicted here. Thus, sitting in front of Mirabelle, first Stacy (turn 01) and later Ivory (turn 03) and Gavin (turn 05) together with Victoria (turn 04) suggest that Mirabelle has to “skip the middle,” meaning the section of the periodic table where the transition elements of metallic nature are listed. Victoria answers Mirabelle’s question about why this is the case by explaining that the elements in the middle constitute a special class (i.e., they are transition elements), about which the course is not concerned at the moment (turn 08), and that they would return to these elements in more detail later in the course (turn 10). Mirabelle, though still not giving up on her method in the content of her talk (turn 13), has returned to around 200 Hz in her pitch range and between 70 and 75 dB for the speech intensity (Fig. 5k). In this turn, her speech rate has returned to her normal level at around 5 syllables per second. At this point, the discussion continues for another minute, taking as its content the fact that Mirabelle’s method works for a particular section of the table and why only for this and not other sections. The tension, which has been apparent previously, has been resolved as seen in the return to normal levels of pitch level, speech intensity, and speech rates on the part of all participants.

figure jfigure j

The previous section already provides evidence that institutional position, normally said to wield control and power, and status do not determine the deployment of prosodic features. In fact, in a non-deterministic approach, position and status cannot determine social and societal phenomena, thereby allowing for successes or failures to emerge contingently from the dialectic of agency and structure. This has been shown in an interesting study where an undergraduate physics student interviewed physics professors about graphs—the dynamics of the transactions could not be predicted, as knowledge (“who is in the know?”) and power came to be the products of the events as much as being resources for the agency of the participants (Roth and Middleton 2006). Quite to the contrary, the evidence provided here is an indication that prosody generally and pitch in particular are resources in the process of managing interaction rituals. Among these social orientations is resistance to comply with the standard question–response sequence or exhibiting a stance that indicates the particular effort made of accommodating the other not only by responding but also by aligning the pitch or repeating pitch contour. Sociological theories concerned with transactions and interaction rituals suggest that the preferred state in transactions is the alignment of rhythmic and prosodic features, because such alignment produces and intensifies emotional contagion, positive emotional energy, and, consequently, solidarity (Collins 2004). Considerable changes in prosodic parameters (e.g., pitch, beat) on the part of one participant with respect to others means breaking out of the rhythmic repetitions and therefore attunement at the microlevel; and it indicates that the person is not responding to the cues of others, who are frustrated in the process. Thus, “the failure of solidarity, down to the minute aspects of coordinating mutual participation in a conversation, is felt as a deep uneasiness or affront… as a feeling of shame” (p. 110). In this study, we articulate the different degrees of coordination at the prosodic level.

These analyses, featuring the emergence and resolution of conflict, provide further support to the contention that prosody generally and pitch in particular are transactional resources available in face-to-face communication and for pragmatic purposes. Conflict is associated with increasing pitch levels over the previous speaker, whereas conflict avoidance and conflict resolution are associated with lowering pitch levels. Thus, when Alex disagrees with a student statement, his pitch remains in his normal pitch range or, if outside, descends into a lower register. On the other hand, in conflict situations, pitch (F 0) levels (as well as F 1 means) characteristically move into higher registers. This is apparent here, as Victoria initially and Mirabelle subsequently both fuel the articulation of difference as their pitch levels move higher and higher as if they were pushing each other on a swing that moves higher and higher. Conversely, our study (exemplified here in the conflict between Victoria and Mirabelle) shows that repeated contributions at lower pitch levels and speech intensity are correlated with a cooling effect that led to the ultimate defusing of conflict.

Transactions between Victoria and Mirabelle produce emotions with negative valence or negative emotions that result from emotion contest and conflict situations typical of face-work or dominance contests. It is apparent from the inflection of the voice—in addition to pitch and speech intensity—that Victoria increasingly is annoyed with what appears to be Mirabelle’s resistance to accepting the articulated trick as the best way to learn chemical valences. The annoyance (irritation, anger) is available to others in and through her expression that there is one correct solution; and the one Victoria has articulated is it. The unfolding transaction shows that Mirabelle takes the teacher’s stand as an outright rejection of her method, even before she had the time to fully articulate it. This may be heard as a lack of attention to her ideas, which would constitute “a breach in the mutual commitment of the participants” (Goffman 1963, p. 90). When Mirabelle hears her method being questioned, she bursts out, her pitch tripling in value, the speech intensity quadrupling, her body movements vigorously oriented toward the teacher, the arm and hand aggressively moving forward and pointing toward the front. In all of these productions—resources for subsequent actions—Mirabelle articulates for others her emotional stance, which incorporates anger and readiness to defend herself against the experienced danger, whatever it might be. These productions require energy, which, as our measurements show, increase tenfold in Mirabelle’s prosodic articulation of frustration and irritation (anger). All of the productions require higher than normal levels of energy, so that one legitimately articulates the situation as charged and characterized by high levels of energy. Anger, in fact, is the capacity to mobilize the energy to overcome barriers to an ongoing effort; and in its most intense form, Mirabelle’s anger is an explosive reaction against the frustrations experienced in gaining acceptance for her method for remembering and recalling chemical valences. Mirabelle can be seen and heard as producing high levels of emotional energy with negative valence; and negative emotional energy is expressed largely by vocal means and gesture.

In this situation that drains significant amounts of emotional energy, the quiet, low-intensity and low-pitched contributions that several classmates produce are followed by Mirabelle’s lowering of the values of the same parameters. These productions, therefore, can be considered resources that oriented the class as a whole toward the current state of the conflict and toward the continued effort of defusing it. Allowing the participants to come out of the loggerheads by providing resources may be a form of commitment to the group and the group processes, that is, a form of solidarity that allows both parties in the conflict to cool down and resolve issues about understanding chemistry. In this school, conflictual situations often end differently, leading students to storm out or drive an altercation to such a level that they come to be sent from the classroom and may be suspended from school for a period of time (Roth et al. 2004b). In such situations, the prosody parameters stay at their elevated levels until a student’s removal from class abruptly ends the situation.

Prosody, entrainment, and alignment

A major question about how humans are able to make happen and bring off the specially human forms of transaction that produce and reproduce everyday ordinary society has been answered in terms of the emotional valence of jointly lived situations and the emotional valence tied to the setting of goals and the likelihood of achieving them. Accordingly, theorists in the sociology of emotion and face-to-face transaction suggest that emotions—articulated and therefore communicated by prosodic and other observable verbal and nonverbal means—constitute the essential feature that binds individuals into a collective. Social binding, cohesion, and alignment derive from resemblances and “social similitude.” “Gluing” and cohesion and the alignment that results from it are said to be possible because multiple pitches and other rhythmic phenomena have the tendency to align themselves when the different instances are not too far apart. Physicists—who have observed already in the seventeenth century that two pendulum clocks close to each other on the same wall will become aligned in their swings—use the concept of entrainment to explain such alignments. Sociologists, social psychologists, and anthropologists concerned with time and temporality of human transactions have borrowed this notion to describe the synchronization of human behavior. In the two preceding sections, we show how speakers align their pitch levels and pitch contours to come into tune with other speakers. During our ongoing ethnographic work we observe other rhythmic phenomena—for example, in the movement of various body parts—in classrooms that align themselves with rhythmic patterns articulated and exhibited by the speaker(s). This is the case even when a recipient does not or cannot see the speaker and therefore gestures and facial expressions that often constitute the primary resources for accessing the emotions of others. To exemplify the presence of collectively enacted rhythmic patterns that align with those of the current speaker, we return to the classroom cotaught by Victoria and Alex.

After Mirabelle and Victoria have had their initial exchange about the method for remembering the valence associated with the atoms of a particular element, Mirabelle begins to articulate her method. (In terms of the overall event, episode 13 immediately follows episode 9.) Mirabelle orients toward the periodic chart of elements in the front of the classroom and moves her eyes from Victoria to the periodic table as she counts out the atomic numbers from three to six, from each of which she proposes to subtract the number two (Fig. 7). The speech intensity for the entire episode is depicted in Fig. 8, which also includes the words and numbers on which the spikes in speech intensity occur.

figure k
Fig. 7
figure 7

Mirabelle produces a beat gesture ending in the forward position precisely with the utterance of the result of each calculation

Fig. 8
figure 8

Mirabelle vocally produces a rhythm that she also produces gesturally; Gavin, who cannot see her, precisely reproduces the same rhythm

In this episode, Mirabelle presents her method of arriving at valence, orienting and staking out her own ground. She does so in a rhythmic way, bounded by the rises, an initially increasing involvement (production of intensity), and, decreasing involvement as she gets into the zone on the periodic table where her method no longer works, at which time she reduces the power of the speech in the air (0.42 μW/m2) to less than one-twentieth of what it was during the outburst (9.6 μW/m2), the sound fades away, and the pitch drops. When the stressed and unstressed syllables are represented using a meter notion from poetry, the rhythmicity of the speech production clearly is evident:

figure l

In this situation, the ingredients for a focused encounter and the production of alignment are all present: bodily copresence, barrier to the outside, mutual focus of attention (periodic table of elements), and shared mood. There is a basic rhythm that underlies the production, coordination, and reproduction of social alignment (i.e., social similitude) within the classroom. Mutual focus and shared mood are linked through a feedback of intensification through rhythmic entrainment, a rhythm set as shown by vocal means. But Mirabelle produces rhythmic patterns by other nonverbal means, too. These features therefore become available for those who are in a position to see her—in addition to expressing the basic rhythm characterizing her emotional investment. Thus, the hand moves forward and reaches the foremost position exactly at the stressed utterance or vowel in the numbers, falling together with the intensity and pitch peaks (Fig. 8). That is, the cyclical hand movement visible in Fig. 7 is patterned such that the foremost position and the stressed syllables fall together.

Throughout the classroom, there are signs that show how others are in synchrony with Mirabelle, even when students are seated and oriented such that they cannot see her. This “beat” of her verbal production is available to others in audible form. With each number in the series three (lithium) to six (carbon), Mirabelle briefly glances to the periodic table, as if verifying what the next number is. The heads and gazes of other students (e.g., Gavin, Tasha, Shawn [for seating see Fig. 4]) also move their regard to focus on the periodic table of elements. They move their gaze simultaneously with Mirabelle although they do not see her and although there is no indication in the speech content itself that suggests others ought to look at the table of elements. That is, the resource for producing this synchrony in orienting gaze is made available in one or more ways other than through visual coordination.

There are other signs of synchrony. For example, Gavin rocks his head slightly back and forth. As Fig. 8 shows, even though Gavin cannot see Mirabelle (Fig. 4), his rhythm perfectly (within the measurement error imposed by the 1/30th of a second video frame rate) reproduces the forward position of Mirabelle’s hand, which itself is aligned with the rhythm with which the account of her method is produced. Gavin also produces the identical rhythmic pattern with his right leg, which swings in a left-to-right motion matching in its extreme left position the foremost position of his head. Thus, when we mark the point in time when his chin reaches the foremost position, these temporal positions coincide with the forward position of Mirabelle’s hand, and the peaks in her speech intensity and pitch. When Mirabelle arrives at “and so on and so on,” Gavin stops rocking his head, the last coincident movement having been a slightest movement with a coincident closing of eyes following an upward movement of head to direct his gaze to the periodic table. As there are no visual means that could have provided him with the resources for the coordination, prosody is the likely candidate that allows synchrony to emerge. We hypothesize that the students as a whole are aligned with Mirabelle in the production of her alternative and that they can empathize with Mirabelle who experiences Victoria’s actions as an affront.

In the following episode, the production of patterns of synchrony is evident when the students are entrained into Victoria’s re-articulation of her “short-cut trick.” The episode follows a first presentation by Victoria about how to find out the valence electrons from the columns of the periodic table. Mirabelle has proposed to continue counting after the end of the first row, attributing to sodium (Na) the number 9. Here, Victoria then reiterates that the maximum number for calculating the valence of an element is 8. In the course of episode 14, two students and Victoria suggest to Mirabelle that for the rows where there are metal elements in the center of the periodic table (for example, scandium to zinc), she has to “skip the middle,” meaning that in this part of the periodic table, counting is suspended. The students, who assist Victoria in articulating an explanation for dealing with Mirabelle’s problem, are in fact aligned with her semantically; and this alignment is articulated also at the nonverbal level where rhythms are enacted in unison with others.

figure m

As the teacher utters, “This column is one,” Tasha, who already has nodded repeatedly affirming Victoria’s statements immediately prior to the beginning of episode 14, moves slightly backwards then forwards, her lips silently forming “one” as her head reaches the foremost position (Fig. 9). Similarly, her head movement and silent lip formation parallel the utterance of “two.” Her hand then moves away from the body, fingers stretch out and point forward toward the periodic table. In synchrony with the teacher’s hitting the chalkboard with the chalk, Tasha verbalizes “three” aloud. In this one instance, the teacher’s utterance of “three” actually is uncoordinated with her beat of the chalk against the chalkboard, but the student is coordinated with it precisely. As the teacher counts out “four,” “five,” “six,” “seven,” and “eight,” Tasha moves her lips forming the words and simultaneously enacts a beat gesture, the hand reaching the down position precisely (within the 1/30th second of accuracy that the video allows) with the words. Behind Tasha, Ivory and another student begin to count at “three” and Ivory sharply nods her head in unison with each uttered word, the chin reaching the forward position precisely at the emphasized syllable. Other students in the class also fall into unison counting beginning with the number three. It is as if the two initial utterances had provided the resources for students to capture the rhythm so that they could produce their articulations of the number in unison with Victoria. Here, too, there is a mutual focus object, the periodic table of elements. The beat may serve to communicate the attention to the common object and by communicating this common attention, teacher and students become mutually aware of their common focus. Apart from social alignment and positive emotional energy, the situation produces shared cognitive experience, that is, understanding of a chemical concept as a sacred object (i.e., symbol of social relationship). It leads to the kinds of episode the student researchers in our group selected as successful teaching and learning.

Fig. 9
figure 9

Tasha enacts a beat gesture in synchrony with the teacher’s counting and action of hitting chalkboard with chalk

In these episodes and many other similar ones in our database, alignment and synchrony are observed beyond the pitch levels and contours presented in the earlier sections. Rhythmic patterns of beat gestures with the hand, rocking movements of legs, rocking head movements, beat of chalk against the chalkboard, and stressed syllables are produced and reproduced by members across the classroom who are in synchrony in the same way as are Schutz’s musicians. Not only members who have the speaker in view produce and reproduce the synchrony—e.g., students counting aloud in synchrony with the teacher—but also members who do not have the current speaker in view. It is important to note that this synchrony therefore is not mere cognitive alignment—something that is processed consciously, interpreted, and applied—but in fact is an embodied and unconscious phenomenon. The body that reproduces the beat has to act in a way that anticipates the occurrence of the driving source frequency. Such rhythmic patterns can be interpreted as baseline patterns of interaction rituals that lead, among others, to solidarity, especially when considering the service-type relations between teachers, who are in a position to help students (their clients) to learn.

In these analyses, synchrony is a resource for all co-present individuals to experience and recognize alignment and agreement. Thus, the synchrony between Tasha’s counting and movement and Victoria’s rhythmic patterns is consistent with Tasha’s repeated head nods following Victoria’s statements about how to remember and recall atomic valences. Social alignment, that is, being aligned in some respect with others, common and communal understanding, here is articulated publicly, available for all to use as a resource in subsequent actions. Social alignment, then, not only is communicated by means of agreements in the cognitive content of statements but also, and perhaps more importantly, by a variety of rhythmic bodily means. From an analytic perspective, therefore, individual participants orient themselves and each other to existing public exhibits of (collective) alignment but also concretely realize individual exhibits of alignment. This phenomenon has positive emotional valence, which in turn is expressed and re-expressed in public exhibits involving prosodic means and a variety of body movements. It is out of this alignment that other students are positioned to produce resources that allow Mirabelle, for example, to calm down prior to producing an altercation of the kind so prevalent in inner-city schools, altercations that frequently result in the suspension of students and their exclusion from schooling.

Emotions, prosody, and power/status

Emotions are central to the ways in which human beings orient to situations; and they are modified by the articulation of emotions of other participants in face-to-face transactions. Besides facial expression, prosody is, as we show here, a major pragmatic resource for displaying and experiencing emotions in a setting. There have been suggestions that participants with less power and status converge in their prosodic parameters to align with those of more power and status; and differences and conflict between transaction participants having roughly equivalent power and status (e.g., children participating in hopscotch games) are characterized by very high values and differences in the pitch levels of participants. The present study augments the existing literature by contributing descriptions of prosody as a transactional resource that is deployed pragmatically (rather than deterministically and causally) and subject to the purposes at hand. Following others, we view prosody as but one aspect of the production of communication all subordinated to the same activity at hand and, therefore, being different expressions of the same societal-psychological or ideological unit. That is, sounds (words), prosody, body position, hand gestures, and other communicative resources articulated at some point in time are different expressions of the same underlying orientation, emotional valence, and meaning unit (Roth and Pozzer-Ardenghi 2006). The meaning of vocal gestures therefore does not lie behind them but is “intermingled with the structure of the world outlined by the gesture” so that “the smile, the relaxed face, gaiety of gesture really have in them the rhythm of action, the mode of being in the world which are joy itself” (Merleau-Ponty 1945, p. 216).

Summarizing our ethnographic evidence, we construct four major empirically grounded claims that can be used as hypotheses to be tested further in cross-site and cross-setting “confirmatory ethnography” or by quasi-experimental studies. First, independent of the power and status differentials, speakers in non-conflictual situations use lowered pitch registers to express difference in content and reluctance to submit to the normal turn-taking routines (e.g., the question–answer sequence). This effect is especially visible when the normal pitch ranges of the two interlocutors are sufficiently different and do not overlap. On the other hand, accommodation and acceding to requests are associated with pitch matching. Second, when two or more teachers coteach the same classes over time, the new teachers tend to “pick” up speech forms, grammatical patterns, and prosodic features from the regular classroom teacher. Initial differences between the pitches produced by new teachers (who take over a speaking turn from students) tend to become continuous in the same way that they are for the (often-seasoned) regular classroom teacher. In non-conflict-laden classrooms generally associated with feelings of solidarity, teachers tend to express differences in terms of lowered pitch levels. Inexperienced teachers and teachers in conflictual relations with their classes, on the other hand, display differences in terms of raised pitch levels. Third, raised pitch levels and speech intensity, when differences in the cognitive (semantic) content of utterances are apparent, tend to “heat up” the classroom atmosphere, whereas contributions at lower pitch levels and speech intensities are associated with a “cooling down” of the situation, that is, they are resources for defusing conflict. Fourth, even in the absence of visual clues, rhythmic patterns of a speaker are reproduced by other participants in the classroom and are displayed prosodically and in a variety of body movements (e.g., rocking head, hand beat, rocking leg, pen hitting desk, chalk hitting chalkboard).

The results of this study suggest the need to theorize the production and reproduction of prosody differently than in previous research. Rather than viewing power and status as factors that determine pitch levels and convergence, we find the production of pitch levels, pitch continuation, pitch level repetitions, and so onto be associated with difference/resistance and accommodation. The concept of social alignment denotes high levels of unity or agreement. Convergence in prosodic parameters among two or more participants is an expression of, and serves as a resource for the further production of alignment, itself an expression of the emotional synchrony of participants. In cases of conflict, some members of the collective may, through the production of lower pitch levels and speech intensities, assist an angered, excited, or animated member to “cool” or calm down.

Rather than students matching the pitch of their teacher, which would have been expected based on existing power/status theory, teachers in non-conflict classrooms of our study tend to produce pitch continuations to match the pitches of their students prior to returning to their own, normal levels during extended turns at talk. New teachers in such classrooms—those in non-conflict relations with their master teacher—fall in line with the pitch matching pattern, frequently leading to long exchanges in which the ending pitch of one speaker flows into the opening pitch of the subsequent speaker. This usually unconscious move on the part of the teacher may find its explanation in the same concept of social alignment, here exhibited by the regular teachers with extended experience in the school, who tend to adapt their ways of speaking to align with the ways in which their students spoke (i.e., taking account of home and street cultures).

We find social alignment—as a phenomenon that participants continuously produce and reproduce—also within the student body. This alignment occurs in a variety of ways, including prosody. We understand these alignments as possible sources for solidarity that the students experience within their peer group; we also understand the alignments as sources for the sense of solidarity that we observed among some teachers and their students. Thus, the synchronous rhythmic features simultaneously found at various places in the classroom suggest that the students are “in tune” or “in sync” with one another. They also express anticipation of particular events, such as when numerous students turn around to face Mirabelle as if they had predicted a likely pattern of response, that is, as if they “saw something coming.” Anticipation inherently means actively understanding the situation in a phenomenological sense, that is, an intuitive lived sense of what is happening and what dangers might loom ahead rather than a reflective understanding. At the same time, these various signs, observable from Mirabelle’s position, may have been resources encouraging her to take up the challenge and propose her alternative description of remembering and recalling valences.

Prosody is an important phenomenon for producing entrainment, a driving force within the set of interaction ritual ingredients that have solidarity and positive emotional energy as part of their ritual outcomes. But in the dialectic of agency and structure, we understand such outcomes also to be resources in and for the ingredient side where feedback intensification through rhythmic entrainment plays a crucial role in the production and reproduction of mutual focus and transient emotions. Entrainment is not describable in terms of a causal link, as the production of synchrony, other than in a mechanical system, requires anticipation. Thus, in one instance, Mirabelle produces a particular rhythm, which Gavin also displays. It is not that Mirabelle’s rhythm causes Gavin to rock in the same rhythm, because Gavin, looking toward the front of the classroom, has no other resource than Mirabelle’s voice. If he had to consciously attend to making his rocking coincide with her activity peaks (intensity, pitch), he would be out of synchrony by something on the order of a second or two. Even if there were non-conscious ways of causing synchrony, he would still be behind her, for he could only know when Mirabelle’s pitch peaks after having heard it. This means, Gavin has to anticipate peaks, which is a production of his own, requiring that he already is in tune. The synchrony of his movement with Mirabelle’s prosody is a consequence of being in tune. Such an anticipation clearly is observable at the instance when Tasha utters “three,” just as Victoria hits the board with the chalk but prior to latter’s utterance of the same number word, and inconsistent with all other instances where number word and the hitting of the board fell together. That is, Tasha anticipates the correct placement of the count with respect to the noise from the chalk, but Victoria, who produces the chalk noise, is out of alignment with her speech, which only follows the rhythm of the chalk beats.

Beyond conventional wisdom

In this study, we use different data sources (ethnographic observation, speech analysis on different prosody parameters, videotape) to provide convergent evidence for the use of prosody as a resource in the production and reproduction of teaching and learning and therefore in the production and reproduction of a particular aspect of society. Clearly, given the conflictual situations we observed, our approach to social and societal phenomena is non-deterministic, allowing us to understand that fields and structures survive the occasional disturbances: one student–teacher argument over valence does not destroy institutional relations or the activity system but it produces modifications. In schooling, these modifications may be sufficient to lead to success and failure for both student (failing a course; dropping out of school) and teacher (attrition; stress with costs to the health-care system). The ethnographic, video-graphic, and sound-analytic evidence supports assertions that prosody and other rhythmic phenomena are a means for signaling alignment, difference, and emotional states across seeming boundaries of cultural origin, gender, and institutional position and associated control/power dimensions (teacher/student and experienced/novice teacher). The phenomena are not merely events of interest to microsociologists, but in fact produce and reproduce societal forms of activity (here schooling generally and school science specifically), which we show in their emergence at micro- and mesolevel grains of analysis. We suppose that the phenomena we describe here can in fact be used for studying other settings of interest to sociologists, such as the making and remaking of managerial authority or the continuous production and reproduction of various forms of institutional talk through talk in institutions. The methods we use work particularly well in the context of other microsociological approaches that draw on conversation analysis, because the different prosodic means constitute resources and constraints for making societally motivated forms of activity in and through face-to-face encounters. We therefore do not think of ourselves as microsociologists, but rather, we see microsociology generally and conversation analysis particularly as but one aspect of an approach that integrates micro-, meso-, and macrolevels in social analysis.

The conventional wisdom about teaching and much of the research in teacher education and policy initiatives are consistent with the cliché “don’t smile until Christmas.” In essence teachers are exhorted and often required to establish control over their students and maintain a quiet class—assumed to be effective. However, especially in classrooms in which diverse forms of culture are enacted by students and their teachers, the alignments needed to produce productive learning environments might be difficult to attain if teachers are expected to exercise power over students, providing structures to afford optimal forms of participation. In fact, students may feel that a teacher’s practices “shut them down” and frustrate their efforts to meet their goals. As our ethnographic work shows, teaching practices that attempt to control students often are perceived by the latter as disrespectful and thereby become resources for resistance, including disruption and refusal to engage. Perhaps the fact that teachers are unsuccessful in their efforts to attain control over students is a reason so many studies in urban classrooms present portraits of dysfunction—students respond by showing their disrespect for the teachers who seek to control them.

In this study we show how successful transactions and the alignment of successive speakers’ prosody go hand in hand. Salient in this study were pitch (F 0), amplitude, and power in the air; and, to a smaller extent, there also is convergent evidence from the less-researched measure, F 1 mean. Although what we have learned is no doubt contingent on larger segments of time, at a microlevel successful transactions appear to be related to a given speaker either aligning with or being less than the previous speaker in the magnitude of pitch, amplitude, and power in the air. We observe that when speakers speak with higher magnitudes on these prosody indicators—voice is much louder, faster, higher-pitched—the environment generally becomes heated and provides resources for asynchrony in terms of emotion and cultural enactment. Hence, counter to the conventional wisdom, teachers might be encouraged not to raise their voice to assert power over students. On the contrary, they might be encouraged to “speak under” their students and in like manner, students might become conscious of the role of prosody in creating and maintaining an emotional climate and associated environments in which learning can occur most readily.

In summarizing what we have learned from this study we do not assert that speakers always should align their prosody or “speak under” the previous speaker. There have been numerous situations in which Alex, for example, used prosody and body movement to establish a rhythm to the classroom that participants experience to be of a high-energy type. In so doing he created structures for students who seemed to easily resonate with high energy teaching, while making it difficult for those who seemed to prefer less energetic forms of participation. We offer this nuance as a hedge against over generalizing what we have learned and to argue for a need for additional research into what might be thought of as classroom rhythms. Perhaps a preference for tuning into classroom rhythms will be yet another form of diversity that teachers and students need to be aware of as educators seek ways to improve the quality of teaching and learning.

Finally, it is clear that teaching, as a caring profession, necessitates forms of teacher–student transactions that afford participation and learning. Ultimately, we might expect more learning to occur when students and teachers find themselves in and collaboratively construct and control environments in which they have a sense that they are “in all of this together,” that is, when there is a sense of solidarity (Roth 2007b). The classroom is a field quite different than the hospital ward, for example. Those of us who have been patients in a hospital ward may have experienced doctors with good bedside manner and others, who are more interested in dispensing their expertise, not taking the time to engage with patients—dispensing and moving on. It is not surprising that studies of the prosody of doctor–patient transactions in a ward might be quite different than teacher-student transactions in a classroom or those of a TV personality interviewing the President’s press secretary. What seems likely, however, is that prosody alignments are associated with the extent to which synchrony emerges within groups and the subsequent entrainment that can produce solidarity, identity changes, positive emotions, and increased success on meeting collective goals.