Main

The world presents itself to us as a series of sensations arising from our own actions, which in turn elicit further actions in an intricate sensorimotor loop. Orofacial sensorimotor control is essential for exploration, communication and survival, and is exquisitely orchestrated11,12,13,14. To investigate the cortical control of complex orofacial movements, we trained head-fixed mice to use sequences of directed licks to advance a motorized port through seven consecutive positions, either from left to right or right to left, after an auditory cue (15 kHz for 0.1 s) that signalled the start of a trial (Fig. 1a, Supplementary Video 1). Each transition from one position to the next was driven in a closed loop by a single lick touching the port. Thus, if a lick missed the port, it would remain at the same position until the tongue eventually made contact. The port was no longer moveable after the mouse finished the seven positions and a water droplet was delivered as a reward after a short delay (0.25 s, or 0.5 s in two mice). The next trial then started with a sequence in the opposite direction after a random inter-trial interval (mean duration of 6 s).

Fig. 1: Sequence licking task.
figure 1

a, Schematic of the (standard) sequence licking task. ITI, inter-trial interval. b, Schematic of the contact force measurement and high-speed (400 Hz) videography in relation to a head-fixed mouse. c, Schematics of the bottom view of a mouse licking at the water port (top). Zoomed-in views (5 × 5 mm) of example high-speed video frames are also shown (bottom). Vectors overlaid in red are outputs from the regression deep neural network and point from the base to the tip of the tongue. Tongue length (L) is defined by the vector length. Tongue angle (θ) is the rotation of the vector from the midline. The red shading depicts tongue shape. d, Time series of task events and behavioural variables during an example trial. Variables recorded from the force sensors include the vertical lick force (Fvert; positive acts to lift the port up) and the lateral lick force (Flat; positive acts to push the port to the right). Kinematic variables including L, its rate of change (L′) and θ were derived from high-speed video. Periods of tongue–port contact are shaded in grey and are numbered sequentially. R3, R2, R1, Mid, L1, L2 and L3 indicate the seven port positions from the rightmost to the leftmost. Missed licks are indicated at bottom by up-arrows. e, Transition diagram depicting the two standard sequences. Darker arrows from right to left correspond to the example trial in d. f, Transition diagrams depicting sequences with backtracking (green arrows). Darker arrows in each diagram correspond to the  row-matched example trials in g. g, Example trials of a left-to-right sequence (top) and a right-to-left sequence (bottom) where the port backtracked (green arrows) when a mouse touched Mid. Licks including both touches and misses are indexed with respect to the lick at Mid. Missed licks are indicated at top by down-arrows.

We measured instantaneous tongue angle (θ), tongue length (L), vertical and lateral components of contact force (Fvert and Flat), and contact duration during sequences (see Methods; Fig. 1b–d, Extended Data Fig. 1a–d). In addition to the continuous θ measurement, we use scalar angle value θshoot to denote the angle of the tongue shooting out in each lick (see Methods) and use capital Θ to represent unified tongue angles after the sign in right-to-left sequences is flipped to pool data from both sequence directions.

Mice modulated each lick to reach different target locations (Extended Data Fig. 1e, f). In addition to stereotypic licking kinematics, expert mice showed remarkable sequence execution speed, with the seven positions completed in about 1 s (Extended Data Fig. 1h). Mice performed the task in darkness with no visual cues to guide the licks. Control experiments (Methods) showed that mice did not rely on auditory (Extended Data Fig. 1i) or olfactory (Extended Data Fig. 1j) cues, but did require tactile feedback from the tongue (Extended Data Fig. 1k). Mice reached proficiency in standard sequences (Fig. 1e) after approximately 1,500 trials of training (Methods; Extended Data Fig. 1l–n).

To determine whether sequence generation was ‘ballistic’ or capable of flexible reconfiguration based on sensory feedback, we varied the task by introducing unexpected port transitions after mice learned standard sequences (Fig. 1f, Supplementary Video 2). On a randomly interleaved subset (one-third or one-quarter) of trials, when a mouse licked at the middle position, the port would backtrack two steps rather than continue to the anticipated position. Mice previously trained only with standard sequences (Methods; Extended Data Fig. 2a, b) learned to detect the change of port transition, branch out to the new position and finish the sequence (Fig. 1g, Extended Data Fig. 2c, d). On average, it took one to two missed licks before mice quickly relocated the port (Extended Data Fig. 2e). Head-fixed mice can thus learn to perform complex and flexible licking sequences guided by sensory feedback.

Optogenetic inhibition screen

To determine which brain regions contribute to the performance of our sequence licking task, and at which points during execution, we performed systematic optogenetic silencing6. In different sessions, bilateral inhibition was centred at each of five regions (Fig. 2a, Extended Data Fig. 3a): the anterolateral motor (ALM)15 cortex (also including part of the M1TJ cortex (ALM–M1TJ hereafter)), the body region of the primary motor (M1B) cortex16,17, the S1TJ cortex2,18, the barrel field of the primary somatosensory (S1BF) cortex and the trunk subregion of the primary somatosensory (S1Tr, including part of the posterior parietal cortex) cortex. For each region, inhibition was triggered with equal probability (10%) at sequence initiation, mid-sequence or at the start of water consumption (Extended Data Fig. 3b). Stimulation at mid-sequence and at consumption was triggered in closed loop by the middle touch and by the first touch after water delivery, respectively.

Fig. 2: Optogenetic inhibition and single-unit activity survey across cortical regions during sequence execution.
figure 2

a, Schematic showing the dorsal view of a mouse brain. Overlaid spots in blue shading depict the five bilateral pairs of sites for illumination of the target cortical regions. Bregma is marked by a 1 × 1 mm crosshair. PPC, posterior parietal cortex. b, Summary of changes in licking kinematics resulting from bilateral photoinhibition of each area, quantified across all three inhibition periods (Methods). Plots summarize the quantifications shown in Extended Data Fig. 3j. The dot colour depicts the amount of increase (red) or decrease (blue) in the indicated behavioural variable for trials with photoinhibition compared with those without. The dot size represents the level of statistical significance. Changes with P > 0.05 are not plotted. Two-tailed hierarchical bootstrap test with Bonferroni correction for 15 comparisons. n = 7 mice. c, Summary of changes in lick rate resulting from bilateral photoinhibition of each area during either sequence initiation (left) or sequence termination (right). Plots summarize the quantifications shown in Extended Data Fig. 3k (top and bottom rows). Conventions and statistical tests are as in b, but with Bonferroni correction for 30 comparisons. d, Silicon probe recording during the sequence licking task (left). Histologically verified locations of the silicon probe recordings are also shown (right). S1L, limb region of primary somatosensory cortex. e, Normalized PETHs of all S1TJ neurons (n = 141 neurons) plotted as heatmaps, aligned to three periods in each sequence direction. Neurons are grouped by functional clusters (see main text) and labelled by colour bands. f, Same as e, but for all M1TJ neurons (n = 233 neurons). g, Same as e, but for all ALM neurons (n = 329 neurons). h, Same as e, but for all S1BF neurons (n = 55 neurons). 3D brains in ad were produced from the Allen Mouse Common Coordinate Framework in the Brain Explorer 2 software, Allen Institute for Brain Science.

Somatosensory inputs both provide information about external objects and enable proprioceptive sensing of the position of the body19 for motor control20,21. Missing sensory feedback can make effortless manipulations surprisingly difficult despite unchanged motor capability22. Normally executed sequences were stereotyped across trials. Therefore, in a given time bin during the sequence, across-trial variability in lick angle (quantified by the standard deviation of Θshoot, or SD(Θshoot)) was relatively low. When the S1TJ cortex was inhibited, however, sequences became disorganized and no longer stereotyped (see examples in Extended Data Fig. 3g, Supplementary Video 3). As a result, SD(Θshoot) increased significantly compared with no inhibition (Fig. 2b, left). Despite disorganized targeting, the ability to direct licks to the sides (that is, |Θshoot|) was uncompromised (Fig. 2b, middle). Inhibition of the S1TJ cortex also did not shorten the length of licks (Fig. 2b, right), although slight but statistically significant increases were observed. Full quantifications of data summarized in Fig. 2b appear in Extended Data Fig. 3j. Together, these data suggest that inhibition of the S1TJ cortex left the core motor capabilities that are required for tongue protrusions and licking intact, but corrupted their proper targeting, possibly due to missing sensory feedback.

By contrast, when inhibiting the ALM–M1TJ cortex, mice had reduced ability to direct licks to the sides (Fig. 2b, middle; see example in Extended Data Fig. 3h), and showed decreased length of lick (Fig. 2b, right). Inhibition of the M1B cortex caused only minor increases in lick angle variability with no decrease in angle deviation or lick length. Inhibition of the S1BF or S1Tr cortex changed no aspects of lick control.

The ALM cortex has been shown to be important in motor preparation of directed single licks to obtain water reward10,15,23. Here we found that inhibiting the ALM–M1TJ cortex at sequence initiation strongly suppressed production of licking sequences (Fig. 2c, left, Supplementary Video 4). In four of seven mice, licks were largely absent (Extended Data Fig. 3k, top panel under ALM–M1TJ). Inhibition of the S1TJ cortex caused more moderate suppression, with no obvious change from inhibiting other regions. When applied at mid-sequence, inhibition of the ALM–M1TJ cortex also suppressed the production of licks, although less strongly. Inhibition of other regions at mid-sequence showed little or no effect. Full quantifications appear in Extended Data Fig. 3k (top and middle rows).

When a sensorimotor sequence reaches its normal stopping point, one might intuitively expect movement to cease in a passive rather than active manner. To our surprise, when inhibiting the S1TJ or M1B cortex at water consumption, mice were impaired at stopping ongoing sequences (Fig. 2c, right, Extended Data Fig. 3k, bottom row; see example in Supplementary Video 5). This prolonged licking was not due to additional attempts to reach the port for water, as mice continuously made successful contacts, nor did we inhibit the water-responsive gustatory cortex24,25.

To test the possibility that inhibition of the S1TJ or M1B cortex caused persistent lick bouts due simply to spread of inhibition to other regions, we repeated the above experiments with half the illumination power (2 mW) (Extended Data Fig. 3l, m). The effects of ALM–M1TJ inhibition on sequence initiation, tongue length and angle control, and of S1TJ inhibition on angle control, remained largely consistent with, although weaker than, our previous results using higher power (4 mW). At consumption, inhibition of the S1TJ or M1B cortex resulted in similarly strong deficits in terminating ongoing sequences (Extended Data Fig. 3m, bottom row). Therefore, the observed deficit in sequence termination was not due to spread of inhibition. Rather, our results indicate that sequence termination is an active process26 mediated collectively by the S1TJ and M1B cortices.

Sequence tiling of single-unit responses

We used silicon probes to record from multiple brain regions from both hemispheres (Fig. 2d) during the task, obtaining 1,537 single units and 303 multiple units (Methods; Extended Data Fig. 4a–e) from 57 recording sessions. Perievent time histograms (PETHs) of single-unit spiking (Fig. 2e–h; example neurons are shown in Extended Data Fig. 4f–h) exhibited a wide variety of patterns before, during and after sequence execution. Spiking that gave rise to the PETHs was consistent across trials (Methods; Extended Data Fig. 4i). To present these PETHs in a way that reflects the main themes observed in the population activity, we pooled neurons from all brain regions and clustered their PETHs using non-negative matrix factorization (Methods).

We observed that single-neuron responses tile the sequence progression (Fig. 2e–h, Extended Data Fig. 4k), with more ALM neurons tuned to sequence initiation (Extended Data Fig. 4l). The S1TJ and M1TJ cortices contained more neurons (for example, cluster 7; Extended Data Fig. 4k, m) that showed greater modulation by individual licks (Extended Data Fig. 4n, o). Patterns of activity arising from these single-unit responses might encode behavioural variables that are important for sequence control.

Hierarchical population coding

Our sequence licking task requires the brain to encode instantaneous tongue length (L) and angle (θ), presumably both for motor output and sensory feedback. Encoding of velocity (L′) could also be used to indirectly control tongue position. Sequence identity (I) and relative sequence time (τ) can be used to represent the sequence-level organization of individual licks beyond instantaneous control. The variable τ can also serve as a proxy for sequence progress or ‘distance to goal’. The five behavioural variables, L, L′, θ, I and τ, were measured (or derived) at 2.5-ms resolution (Fig. 3a). Conveniently, any pair of these variables was uncorrelated (Extended Data Fig. 5b). Therefore, being able to encode one is of little or no help with encoding any other.

Fig. 3: Neuronal populations code with increasing levels of abstraction across cortical areas.
figure 3

a, Time series of the behavioural variables (mean ± 99% bootstrap confidence interval; n = 2,684 trials) for different sequence types. Time points in which more than 80% of trials had no observations are not plotted. LR, left to right; RL, right to left. Vertical dashed lines indicate times 0 s and 0.5 s on the x axis. b, Decoding of the five behavioural variables (rows) from populations recorded in the S1TJ, M1TJ, ALM and S1BF cortices (columns). Cross-validated R2 values for each region and variable are given (mean ± s.d.). n = 8 sessions (S1TJ), n = 9 sessions (M1TJ), n = 13 sessions (ALM) and n = 5 sessions (S1BF) for all graphs in Fig. 3 unless otherwise noted. Same plotting conventions as in a. c, Bars show the means of R2 values from b. Circles show R2 values for individual sessions. d, Neural trajectories from the ALM cortex (mean) during standard sequences (linked by dashed lines). The arrows indicate the direction of time. Decoded trajectories (darker thick curves) are overlaid with trajectories (lighter thick curves) in the space of the top three principal components (PCs), after a linear transformation. A projection into the I–θ plane is depicted with thinner and lighter curves. e, Mean canonical correlation coefficients (r) for each neural population (grey traces) across three conditions. The average mean r values for each condition are shown in black. *P < 0.001, not significant P > 0.05 otherwise, paired two-tailed permutation test. f, Classification of standard versus backtracking sequences from population activity. Accuracy is the fraction of trials correctly classified (mean ± 95% hierarchical bootstrap confidence interval; n = 6 sessions (ALM)). Coloured traces and error shadings are from original data; black traces and shadings are from data with randomly shuffled trial labels. The average time series of tongue length are overlaid (grey traces) to show the concurrent behaviour. The dashed grey traces indicate licks that unexpectedly missed the port as a result of the port backtracking.

For each recording session, we performed separate linear regressions (Methods) to obtain unit weights (and a constant) for each of the five behavioural variables, such that a weighted sum of instantaneous spike rates from simultaneously recorded units (32 ± 13 units; mean ± s.d.) plus the constant best predicted the value of a behavioural variable. We used cross-validated R2 values to quantify how well the recorded population of neurons encoded each behavioural variable27.

The five behavioural variables were decoded from population activity on a single-trial basis (see examples in Extended Data Fig. 5c, d). Overall, the S1TJ and M1TJ cortices had stronger coding of L and L′ than the ALM cortex and the control region S1BF (Fig. 3b, c). S1TJ, M1TJ and ALM cortices, but not the S1BF cortex, all showed comparable encoding of θ (Fig. 3b, c). However, the traces of decoded θ in the S1TJ and M1TJ cortices contained rhythmic fluctuations that were absent in the ALM cortex, despite similar overall levels of encoding of θ (R2 values). These fluctuations indicate that the M1TJ and S1TJ cortices encoded θ in a more instantaneous manner, whereas the ALM cortex encoded θ in a continuously modulated manner that may provide a control signal for the intended lick angle or represent the position of the target port.

Higher-level cortical regions are in part defined by the presence of more abstract (or latent) representations of sensory, motor and cognitive variables28 . Compared with L, L′ and θ, which describe the kinematics of individual licks, I and τ describe more abstract motor variables. In the ALM cortex, we found the strongest encoding of both I and τ (Fig. 3b, c). Encoding of I and τ became progressively weaker in the M1TJ, S1TJ, and S1BF cortices, respectively. Overall, these results reveal a neural coding scheme with increasing levels of abstraction across the S1TJ, M1TJ and ALM cortices during the execution of flexible sensorimotor sequences.

Good decoding may come from a small fraction of informative units or from dominant activity patterns across a population. Distinguishing these requires comparing the similarity between activity patterns captured by the coding axes (defined by the vector direction of regression weights), as shown above, and the dominant patterns in population activity identified in an unsupervised manner. In each recording session, we obtained neural trajectories in the coding subspaces (the subspaces spanned by coding axes) via linear decoding and trajectories in principal component subspaces (the subspaces spanned by the first few principal components) via principal component analysis. Trajectories in principal component subspaces depict dominant patterns in population activity, but the principal components per se need not have any behavioural relevance. To see whether neural trajectories in coding and principal component subspaces were the same except for a change (rotation and/or scaling) in the reference frame, we used canonical correlation analysis (Methods) to find the linear transformation of the two trajectories such that they were maximally correlated29.

After transformation, trajectories of the ALM population in the subspace of the top three principal components aligned (Fig. 3d) and correlated (Fig. 3e; group 2 in the ALM cortex) well with the trajectories in the subspace encoding θ, I and τ. This indicates that the dominant neural activity patterns in the ALM population encoded θ, I and τ. As the ALM cortex minimally encoded L and L′, including these in the coding subspaces decreased the correlation with principal component trajectories (Fig. 3e; groups 1 and 3 in the ALM cortex). The decoded trajectories and principal component trajectories in the M1TJ and S1TJ cortices also showed a strong correlation but only when the coding subspaces included L and L′.

Across regions, the sum of variance explained by the five coding axes reached about half that of the top five principal components (Methods; Extended Data Fig. 5e). The five coding axes were largely orthogonal with each other (Extended Data Fig. 5f), indicating that they not only captured dominant neural dynamics but also did so efficiently with little redundancy.

Sequence branching signals in ALM–M1TJ

In backtracking sequences, mice licked back to a previous angle to relocate the port and then progressed through the rest of the sequence. The opposing deflections in the decoded θ from backtracking trials matched this behaviour (Fig. 3a, b, dashed curves for θ). This is not surprising as the M1TJ and ALM cortices are expected to encode the changed motor program, and the S1TJ cortex to signal the resulting proprioceptive or reafferent feedback. However, the motor cortical mechanisms that allow sensory feedback to integrate with unfolding motor programs30,31,32,33,34,35 could involve a movement-nonspecific signal to indicate sequence branching.

We used a linear support-vector machine to classify trials into either backtracking or standard sequences based on population activity at each time bin (Methods). Within each class, about equal numbers of left-to-right and right-to-left sequences were pooled so classifiers could not rely on the coding of specific licking movements. ALM and M1TJ activity started to predict the presence or absence of backtracking during the initial missed lick (Fig. 3f). We randomly shuffled class labels to determine chance-level classification accuracy. S1TJ populations showed only a statistically insignificant trend towards being able to distinguish backtracking from standard sequences (Fig. 3f), at much later time points (Extended Data Fig. 5g). As expected, S1BF populations showed no prediction.

Context-dependent coding of subsequences

Complex sequences can be composed of different combinations of subsequences. The same subsequence can be used in multiple complex sequences, and it is crucial for the brain to keep track of the context in which a subsequence is executed36,37,38,39. To search for such sequence context signals, we trained mice on two new sequences where the port steps in a ‘zigzag’ manner from one side to the other, then steps back, and then again steps to the other side (Fig. 4a, b, Supplementary Video 6). The two sequences have symmetrical movements. By fixing one and shifting the other forwards or backwards in time, it is possible to find subsequences that have the same licking movements but different sequence contexts (Fig. 4c). There are in total four ways to shift and match subsequences, and we focused on the three licks in the middle (Fig. 4d) for analysis.

Fig. 4: Context-dependent coding of subsequences in ALM.
figure 4

a, Transition diagrams depicting the two zigzag sequences, which contain symmetrical transitions. b, Example trials showing patterns of tongue angle in the two zigzag sequences. c, The four ways to shift and match subsequences. Coloured traces show tongue angles from an example session (mean ± s.d.). The arrow colours indicate the sequence to be shifted. The arrow lengths and the number in milliseconds show how much the chosen sequence must be shifted to match the other. LRLR, left–right–left–right; RLRL, right–left–right–left. d, Zoomed-in plots of c showing the three licks in the middle of matched subsequences. e, Example rasters and PETHs for three simultaneously recorded neurons. PETHs are normalized to the maximum spike rate for each neuron across the four shifts. f, Classification accuracy (black trace) for sequence identity based on population activity for the session in ce. Chance accuracy (grey shading) was determined by randomly shuffling sequence labels. g, Similar to f, but showing mean ± 95% hierarchical bootstrap confidence interval across sessions (n = 6) and mice (n = 3). The two grey vertical bars in dg are gridlines to aid visualization of matching time points across plots.

Three simultaneously recorded ALM neurons illustrate three types of response (Fig. 4e). The first neuron preferentially fired during blue-coloured sequences, and the second neuron fired during red-coloured sequences, whereas the third neuron responded faithfully to the physical movements with no clear sequence preference (Fig. 4e, neurons 1–3, respectively). Using population activity as a predictor, linear support-vector machine classifiers (Methods) were able to predict the sequence identity, or context, in the example session (Fig. 4f) and across sessions and mice (Fig. 4g). Chance-level classification accuracy was determined by shuffling the sequence labels.

Our results provide strong evidence that ALM neurons in mice encode complex sequences with combined information about both physical movements and the latent sequence context.

Reward modulation in ALM

In the decoding analysis for standard sequences, the τ coding axis was identified by fitting models to link neural activity and relative sequence time. We performed the same decoding analysis with zigzag sequences and found a similar ramping pattern of τ (Extended Data Fig. 5h). The monotonic coding of τ therefore does not require a constant sequence direction. However, if τ faithfully represents time, the downward deflection of traces from backtracking sequences (Fig. 3b) should not appear, as time advances regardless of what the animals do. This suggests representation of a distance to goal40, which might correspond to arrival at the last port position, water delivery, finishing water consumption, and so on.

The ALM cortex contained single neurons (Extended Data Fig. 6a) that fired actively during sequence execution but abruptly decreased firing upon tongue contact with water, even though mice continued with approximately five consummatory licks (Extended Data Fig. 6b) of similar or more strongly modulated kinematics and force (Extended Data Fig. 6c). The τ decoded from ALM populations showed similar time courses (Extended Data Fig. 6d, top left).

ALM activity was thus modulated by reward41 so as to signal reward expectation in a manner that smoothly increased as mice approached water delivery, regardless of sequence direction or lick angle, that was suppressed by the delay of progress upon backtracking, and that terminated at water delivery despite continued licking. Coding of I and θ followed more complex time courses than τ (Extended Data Fig. 6d, e).

ALM encodes upcoming sequences

In our task, sequences alternated direction across trials (Extended Data Fig. 7a). Before each trial, there was no cue to indicate the starting side. Expert mice nevertheless usually initiated sequences from the correct side without exploring the other (Extended Data Fig. 7b), suggesting internal maintenance of information about target position during inter-trial intervals. Brain regions maintaining such information may contribute to organizing higher-level sequences across trials.

In the ALM cortex, we found simultaneously recorded units that fired persistently to specific target position values during the inter-trial interval (Extended Data Fig. 7c). A linear model fitted using data from the second before cue onset showed smooth population decoding of target position across the span of many trials (Extended Data Fig. 7d). On average, ALM populations showed stronger encoding of target position (Extended Data Fig. 7e, f) than other regions. When using this model to decode during sequence execution, the resulting traces from two sequence directions crossed at mid-sequence (Extended Data Fig. 7g), showing similar structure as θ. None of the regions, including the ALM cortex, encoded time or a distance to trial start (Extended Data Fig. 7h), perhaps because our inter-trial interval contained an exponential portion (Methods) that made the time to trial start unpredictable7.

Together, our results from behaviour analysis, population electrophysiology and optogenetics define key sensory and motor cortices in mice that govern hierarchical execution of flexible, feedback-driven sensorimotor sequences.

Methods

Mice

All procedures were in accordance with protocols approved by the Johns Hopkins University Animal Care and Use Committee (protocols: MO18M187 and MO21M195). Mice were housed in a room on a reverse light–dark cycle, with each phase lasting 12 h, and maintained at 20–25 °C and 30–70% humidity. Before surgery, mice were housed in groups of up to five, but afterwards were housed individually. Fifteen mice (12 male and 3 female) were obtained by crossing VGAT-IRES-Cre (Jackson Labs: 028862; B6J.129S6(FVB)-Slc32a1tm2(cre)Lowl/MwarJ)42 with Ai32 (Jackson Labs: 012569; B6;129S-Gt(ROSA)26Sortm32(CAG-COP4*H134R/EYFP)Hze/J)43 lines. Two (one male and one female) were heterozygous VGAT-ChR2-EYFP (Jackson Labs: 014548; B6.Cg-Tg(Slc32a1-COP4*H134R/EYFP)8Gfng/J)44 mice. Twelve (nine male and three female) were wild-type mice, including nine C57BL/6J (Jackson Labs: 000664) mice, one wild-type littermate for each of VGAT-ChR2-EYFP, TH-Cre (Jackson Labs: 008601; B6.Cg-7630403G23RikTg(Th-cre)1Tmd/J)45, and Etv1-Cre−/− (Jackson Labs: 013048)46. Two were male TH-Cre mice. Two (one male and one female) were Advillin-Cre (Jackson Labs: 032536; B6.129P2-Aviltm2(cre)Fawa/J)47 mice. Mice ranged in age from approximately 2 to 9 months at the start of training. A set of behavioural testing sessions typically lasted approximately 1 month (Supplementary Table 1).

Surgery

Before behavioural testing, mice underwent implantation of a metal headpost. For surgical procedures, mice were anaesthetized with isoflurane (1–2%) and kept on a heating blanket (Harvard Apparatus). Lidocaine or bupivacaine was used as a local analgesic and injected under the scalp at the start of surgery. Ketoprofen was injected intraperitoneally to reduce inflammation. All skin and periosteum above the dorsal surface of the skull were removed. The temporal muscle was detached from the lateral edges of the skull on either side and the bone ridge at the temporal–parietal junction was thinned using a dental drill to create a wider accessible region. Metabond (C & B Metabond) was used to cover the entirety of the skull surface in a thin layer, seal the skin at the edges and cement the headpost onto the skull over the lambda suture.

To make the skull transparent, a layer of cyanoacrylate adhesive was then dropped over the entirety of the Metabond-coated skull and left to dry. A silicone elastomer (Kwik-Cast) was then applied over the surface to prevent deterioration of skull transparency before photostimulation. Buprenorphine was used as a post-operative analgesic and the mice were allowed to recover over 5–7 days following surgery with free access to water.

For silicon probe recording, a small craniotomy of about 600 μm in diameter was made for implantation of a ground screw. The skull was thinned using a dental bur until the remaining bone could be carefully removed with a tungsten needle and forceps. Following this, one or more craniotomies of about 1 mm in diameter were made over the sites of interest for silicon probe recording. Craniotomies were protected with a layer of silicone elastomer (Kwik-Cast) on top. Additional craniotomies were usually made in new locations after finishing recordings in previous ones.

Task control

Task control was implemented with an Arduino-based system (Teensy 3.2 and Teensyduino), including the generation of audio (Teensy Audio Shield). Custom MATLAB-based software with a graphical user interface was developed to log task events and change task parameters. Touches between the tongue and the port were registered by a conductive lick detector (Svoboda lab, HHMI Janelia Research Campus), in which the mouse acted as a mechanical switch that opened (no touch) or closed (with touch) the circuit. Any mechanical switch has electrical bouncing issues when a contact is weak and unstable. To handle bouncing during loose touches, we merged any contact signals with intervals less than 60 ms.

The auditory cue that signalled the beginning of each trial was a 0.1 s long, 65 dB SPL and 15 kHz pure tone. Touches that occurred during the auditory cue were not used to trigger port movement as they were probably due to impulsive licking rather than a reaction to the cue.

The lick port was motorized in the horizontal plane by two perpendicular linear stages (LSM050B-T4 and LSM025B-T4, Zaber Technologies), one for anterior and posterior movement and the other for left and right. A manual linear stage (MT1/M, Thorlabs) installed in the vertical direction controlled the height of the lick port. The motors were driven by a controller (X-MCB2, Zaber Technologies), which was in turn commanded by the Teensy board via serial interface communication. Although the linear stages were set up in cartesian coordinates, we specified the movement of the port using a polar coordinate system. For a chosen origin of the polar coordinates, the seven port positions were arranged in an arc symmetrical to the midline with equal spacing (in arc length) between adjacent positions (Fig. 1a).

A movement of the lick port was triggered by the onset of a touch during sequence performance. A second port movement could not be triggered within a refractory period of 80 ms, which prevented mice from driving a sequence by constantly holding the tongue on the port (although we never observed such behaviour). When a movement was triggered, the port first accelerated (477 or 715 mm s−2) until the maximal speed (39.3 mm s−1) was reached, then maintained the maximal velocity, and decelerated until it stopped at the end position. The acceleration and deceleration phases were always symmetrical, such that the maximal velocity might not be reached if the distance of travel was short.

The movement was typically in a straight line. For four of the nine mice, when the two positions were not adjacent (for example, at backtracking and the following transition), the port would move in an outward half circle whose diameter was the linear distance separating the two positions. This arc motion minimized the chance of mice occasionally catching the port prematurely before the port stopped. Nevertheless, catching the port prematurely did not trigger the next transition in a sequence because, in this case, the port movement could only be triggered again after 200 ms from the start of backtracking (and 300 ms after the following touch). As a result, mice always needed to touch the port at the fully backtracked position to continue progress in a sequence.

The control of port movement was similar for zigzag sequences except that five port positions were used instead of seven, the refractory period before the next trigger was 100 ms, the acceleration was 2,000 mm s−2, the maximal speed was 75 mm s−1 and every port movement travelled along an outward half circle.

Mice performed the task in darkness with no visual cues about the position of the port. To prevent mice from using sounds emitted by the motor to guide their behaviour, we played two types of noise throughout a session. The first was a constant white noise (cut-off at 40 kHz; 80 dB SPL) and the second was a random playback (with 150–300-ms interval) of previously recorded motor sounds during 12 different transitions.

Two-axis optical force sensors

A stainless steel lick tube was fixed on one end to form a cantilever. Mice licked the other free end, producing a small displacement (approximately less than 0.1 mm at the tip for 5 mN) of the tube. Two photointerrupters (GP1S094HCZ0F, Sharp) placed along the tube (Extended Data Fig. 1c, d) were used to convert the vertical and horizontal components of displacement into voltage signals. Specifically, the cantilever normally blocked about half of the light passing through, outputting a voltage value in the middle of the measurement range. Pushing the tip down caused the cantilever to block more light at the vertical sensor and thereby decreased the output voltage; conversely, less force applied at the tip resulted in increased voltage. For the horizontal sensor, pushing the tube to the left or right decreased or increased the voltage output, respectively. Output was amplified by an op-amp then recorded via an RHD2000 Recording System (Intan Technologies).

By design (the circuit diagram and the displacement–response curve are available in the GP1S094HCZ0F datasheet), the force applied at the tip of the lick tube and the output voltage of the sensor follow a near linear relationship within a range of forces. To find this range, we measured the voltages (relative to baseline) with different weights added to the tip. Excellent linearity (R2 = 0.9999) was achieved up to more than 20 mN (Extended Data Fig. 1d). By contrast, the maximal force of a lick was on average about 4 mN (Extended Data Fig. 1f).

The motorization of the lick tube introduced mechanical noise to the force signals. The spectral components of these noises were mainly at 300 Hz and its higher harmonics, presumably due to the resonance frequency of the tube, whereas the force signal induced by licking occupied much lower frequencies. Therefore, we low-pass (at 100 Hz) filtered the original signal (sampled at 30 kHz) to remove the motor noise. Additional interference came from the 850-nm illumination light used for high-speed video, which leaked into the optical sensors (mainly in early experiments with two mice) and caused slow fluctuations in the baseline over seconds. To mitigate this slow drift, we used a baseline estimated separately for each individual lick as follows. We first masked out the parts of the signal when the tongue was touching the port, then linearly interpolated to fill in these masked out lick portions using the neighbouring (that is, no touch) values. These interpolated time series served as the baseline for each lick. As the lick force was only a function of voltage change compared to baseline, the above procedure would at most negligibly affect the force estimation. Owing to the dependency of this procedure on complete touch detection, we excluded eight sessions from behavioural quantifications in Fig. 1 and Extended Data Figs. 1, 2 in which only touch onsets were correctly registered.

High-speed videography and tongue tracking

High-speed video (400 Hz, 0.6-ms exposure time, 32 µm per pixel, 800 × 320 pixels) providing side and bottom views of the mouth region was acquired using a ×0.25 telecentric lens (55–349, Edmund Optics), a PhotonFocus DR1-D1312-200-G2-8 camera and Streampix 7 software (Norpix). Illumination was via an 850-nm LED (LED850-66-60, Roithner Laser) passed through a condenser lens (Thorlabs).

Three deep convolutional neural networks were constructed (MATLAB 2017b, Neural Network Toolbox v11.0) to extract tongue kinematics and shape from these videos. The first network classified each frame as ‘tongue-out’ if a tongue was present, or ‘tongue-in’ otherwise. This network was based on ResNet-50 (ref. 48) (pretrained for ImageNet), but the final layers were redefined to classify the two categories using a softmax layer and a classification layer that computes cross-entropy loss. A total of 37,658 frames were manually labelled in which 1,611 frames were set aside as testing data. Image augmentation was performed to expand the training dataset. A standard training scheme was used with a mini-batch size of 32 and a learning rate of 1 × 10−4 to 1 × 10−5. The fully trained network achieved a high accuracy in classifying the validation data (Extended Data Fig. 1a).

The second network assigned a vector from the base to the tip of the tongue in each frame classified as tongue-out. L and θ were derived from this vector (Fig. 1c). A total of 12,095 frames were manually labelled in which 643 frames were used only for testing. The architecture and training parameters of this network are similar to those of the classification network except that the final layers were redefined to output the x and y image coordinates of the base, tip and two bottom corners (not used in analysis) of the tongue with mean absolute error loss. The regression error of the fully trained network in testing data was 3.1 ± 5.4° for θ and 0.00 ± 0.13 mm for L (mean ± s.d.). This performance was comparable to human level (Extended Data Fig. 1b). Specifically, a subset of frames (separate from testing data) was labelled by each of the five human labellers. The variability in human judgement was quantified by the differences between L and θ from individual humans and the human mean for each frame. We also computed the differences between L and θ from the network and the human mean for each frame. The two distributions showed a comparable variability, although the network showed small biases (L: humans 0 ± 0.11 mm, network −0.05 ± 0.10 mm; θ: humans 0 ± 5.7°, network 3.3 ± 5.5°; mean ± s.d.).

In a subset of trials and in frames classified as tongue-out, the third network, a VGG13-based SegNet49, extracted the shape of the tongue by semantic image segmentation, that is, classifying each pixel as belonging to a tongue or not. Human labellers used a 10-vertex polygon to encompass the area of the tongue in a total of 3,856 frames. The training parameters were similar to the other networks except for a mini-batch size of eight and a learning rate of 1 × 10−3.

Behavioural training

Behavioural sessions occurred once per day during the dark phase and lasted for approximately 1 h or until the mouse stopped performing, whichever came earlier. Mice would receive all of their water from these sessions, unless it was necessary to supply additional water to maintain a stable body weight. The amount of water consumed during behaviour was measured by subtracting the pre-session volume of water in the dispenser from the post-session volume. On days in which their behaviour was not tested, they received 1 ml of water. Mice were water restricted (1 ml daily) for at least 7 days before beginning training. Whiskers and hairs around the mouth were trimmed frequently to avoid contact with the port.

The precise position of the implanted headpost varied across mice, so each mouse required an initial setup of the positions of the lick port. The lick port moved in an arc with respect to a chosen origin (see ‘Task control’). The origin was initially set at the midline of the animal and 2 mm posterior from the posterior face of the upper incisors. If there was any yaw of the head, the whole arc was rotationally shifted accordingly. The height of the lick port was manually adjusted until it was approximately 1 mm below the interface between the upper and lower lips when the mouth was closed.

In initial training sessions, the distance between the leftmost (L3) and the rightmost (R3) lick port position was reduced, the radius of the arc was shortened and the water reward was larger. As mice learned the task, both the L3 to R3 distance and the radius of the arc were gradually increased over a few days of training (Extended Data Fig. 1m). The difficulty of the task was increased whenever the mouse showed improvements in performing the task at the current port distance, radius and reward size. The difficulty remained constant in two conditions: either when the maximum set of parameters had been met (a radius of 5 mm for male mice and 4.5 mm for female mice) or if the mouse appeared demotivated (typically indicated by a notable decrease in the number of trials and licks). During the initial training sessions, water was occasionally supplemented at other points during the sequence to encourage licking behaviour. The amount of water reward per trial was eventually lowered to approximately 3 μl. For 3 of the 33 mice included in this study, we first trained them to lick in response to the auditory cue with the lick port staying at fixed positions. After mice responded consistently to the go cue, we shifted to the complete task with gradually increased difficulty. Although the three mice performed similarly to others when well trained, this procedure proved to be less efficient than beginning with the complete task.

Once a mouse had become adept at standard sequences, they were trained on the backtracking sequences. The first nine fully trained mice were used in backtracking related analyses; later, mice used for other purposes were not always fully trained in backtracking. For five of the nine mice, we first trained them with backtracking trials in only one direction and added the other direction once they mastered the first. For three of the nine mice, backtracking trials and standard trials were organized into separate blocks of 30 trials each. In developing this task, we tested subtle variations in the detailed organization of trial types, such as varying the percentage of backtracking trials in a block, or different forms of jumps in the port position. Details appear in Supplementary Table 1. Two of these three mice continued to perform the block-based backtracking trials during recording sessions. All nine mice eventually learned backtracking sequences but showed mixed learning curves (Extended Data Fig. 2a, b). About three mice were more biased towards previously learned standard sequences and tended to miss the port many times before relocating the lick port through exploration. The other six mice more readily made changes.

The shaping processes for zigzag sequences in a total of four mice all differed. Empirically, however, training on standard sequences first until proficiency and then on zigzag sequences could produce desirable performance.

Hearing loss

Hearing loss experiments were performed to exclude the possibility that mice used sounds produced by the motors to localize the motion of the lick port during sequence performance. To induce temporary hearing loss (approximately 27.5 dB attenuation)50, we inserted two earplugs made of malleable putty (BlueStik Adhesive Putty, DAP Products Inc.) into the openings of the ear canal bilaterally under microscopic guidance. Earplugs were shaped like balls and then formed appropriately to cover the unique curvature of each ear canal. When necessary, the positioning of the earplugs was readjusted, or larger balls were inserted. Five well-trained mice performed one ‘earplug’ session and one control session. Mice did not have experience with earplugs before the earplug session. In earplug sessions, mice were first anaesthetized under isoflurane to implant earplugs (taking 11–12.5 min), then were put back in the homecage to recover from anaesthesia (taking 10–11.5 min), and performed the task after recovery. In control sessions, mice were anaesthetized for the same duration and allowed to recover for the same duration before performing the task.

Odour masking

Odour masking experiments were performed to exclude the possibility that mice used potential odours emanating from the lick port to localize its position during sequence performance. A fresh air outlet (1.59 mm in diameter) was placed in front of the mouse and aimed at the nose from approximately 2 cm away with an approximately 45° downward angle. We checked the coverage of air flow (2 LPM) by testing whether a water droplet (approximately 3 μl) would vigorously wobble in the flow at various locations, and confirmed that both the nose and all seven port positions were covered. Before the test session, head-fixed mice were habituated to occasional air flows when they were not performing sequences. In the test session, the air flow was turned off first and turned on continuously after the one-hundredth trial (in four mice) until the end of the session, or turned on first and turned off after the one-hundredth trial (in two mice). The air-off period served as the control condition for the air-on period.

Tongue numbing

Tongue numbing experiments were performed to directly test whether proper sequence execution depended on tactile feedback from the tongue. The sodium channel blocker lidocaine is used clinically to block signals from somatosensory afferents in the periphery. Before a behavioural session, mice were anaesthetized under isoflurane, and a cotton ball soaked with 2% lidocaine (for numbing) or saline (as control) was inserted into the oral cavity, covering the tongue. After 10 min, the cotton ball was removed, the anaesthesia was terminated and the mice woke up in a behavioural setup to perform standard sequences. As lidocaine has a relatively short half-life, we limited the analysis to trials performed within approximately 30 min after removing the cotton ball. One of the six mice was excluded from analysis as it was unable to perform the task within approximately 30 min after its tongue was numbed.

Electrophysiology

Two types of silicon probe were used to record extracellular potentials. One (H3, Cambridge Neurotech) had a single shank with 64 electrodes evenly spaced at 20-µm intervals. The other (H2, Cambridge Neurotech) had two shanks separated by 250 µm, where each shank had 32 electrodes evenly spaced with 25-µm intervals. Before each insertion, the tips of the silicon probe were dipped in either DiI (saturated), CM-DiI (1 mg ml−1) or DiD (5–10 mg ml−1) ethanol solution and allowed to dry. Probe insertions were either vertical or at 40° from the vertical line depending on the anatomy of the recorded region and surgical accessibility. Once fully inserted, the brain was covered with a layer of 1.5% agarose and ACSF, and was left to settle for approximately 10 min before recording. On the basis of the depth of the probe tip, the angle of penetration and the position of these sites, the location of units could be determined. Units recorded outside the target structure were excluded from analysis.

Extracellular voltages were amplified and digitized at 30 kHz via an RHD2164 amplifier board and acquired by an RHD2000 system (Intan Technologies). No filtering was performed at the data acquisition stage. Kilosort51 was used for initial spike clustering. We configured Kilosort to high-pass filter the input voltage time series at 300 Hz. The automatic clustering results were manually curated in Phy for putative single-unit isolation. We noticed a previously reported issue of Phy double counting a small fraction of spikes (with exact same timestamps) after manually merging certain clusters, thus duplicated spike times in a cluster were fixed post-hoc to keep only one.

Cluster quality was quantified using two metrics (Extended Data Fig. 4a–c, e). The first was the percentage of inter-spike intervals violating the refractory period (RPV). We set 2.5 ms as the duration of the refractory period and used 1% as the RPV threshold above which clusters were regarded as multi-units. It has been argued that RPV does not represent an estimate of false alarm rate of contaminated spikes52,53 as units with low spike rates tend to have lower RPV, whereas units with high spike rates tend to show higher RPV even if they are contaminated with the same percentage of false-positive spikes. Therefore, we estimated the contamination rate based on a reported method52. A modification was that we computed the mean spike rate of a cluster from periods during which the spike rate was at least 0.5 spikes per second rather than from an entire recording session. As a result, the mean spike rate reflected more about neuronal excitability than task involvement. Any clusters with more than 15% contamination rate were regarded as multi-units. Combining these two criteria in fact classified fewer single units than using a single, although more stringent, RPV of 0.5%. A low RPV can fail potentially well-isolated fast-spiking interneurons whose inter-spike intervals can frequently be shorter than the set threshold.

Photostimulation

We used the ‘clear-skull’ preparation6, a method that greatly improves the optical transparency of intact skull (see the ‘Surgery’ section), to non-invasively photoactivate channelrhodopsin-expressing GABAergic neurons and thus indirectly inhibit nearby excitatory neurons (Extended Data Fig. 3a).

Bilateral stimulation of the brain was achieved using a pair of optic fibres (0.39 NA, 400-µm core diameter) that were manually positioned above the clear skull before the beginning of each behavioural session. These optic fibres were coupled to 470-nm LEDs (M470F3, Thorlabs). The illumination power was externally controlled via WaveSurfer (http://wavesurfer.janelia.org). Each stimulation had a 2-s long 40-Hz sinusoidal waveform with a 0.1-s linearly modulated ramp-down at the end. The peak powers in the main experiments were 16 mW and 8 mW. We used the previously reported 50% transmission efficiency of the clear-skull preparation6 and report the estimated average power in the main text. There was a 10% chance of light delivery triggered at each of the following points in a sequence: cue onset, the middle touch or the first touch after water delivery. To ensure that the light from photostimulation did not affect the performance of the mouse through vision, we set up a masking light with two blue LEDs directed at each eye of the mouse. Each flash of the masking light was 2 s long separated by random intervals of 5–10 s. This masking light was introduced several training sessions in advance of photostimulation to ensure that the light no longer affected the behaviour of the mouse. In addition, the optic fibres were positioned to shine light from approximately 5 to 10 mm above the head of the mouse on these days leading up to photostimulation.

In a subset of silicon probe recording sessions (related to Extended Data Fig. 3c–f), we used an optic fibre (0.3 NA, 400-µm core diameter) to simultaneously photoinhibit the same cortical region (within 1 mm) or a different cortical region (approximately 1.5 or approximately 3 mm away) via a craniotomy. The tip of the fibre was kept approximately 1 mm away from the brain surface. For testing the efficiency of photoinhibition, the same 2-s photostimulation was applied but only at the mid-sequence, with 7.5% probability for each of the four powers (1, 2, 4 and 8 mW). For each isolated unit, the photo-evoked spike rate was normalized to that obtained during the equivalent 2-s time window without photostimulation. To avoid a floor effect, we also excluded units that on average fired less than one spike during the no stimulation windows. We classified units as putative pyramidal neurons if the width of the average spike waveform (defined as time from trough to peak) was greater than 0.5 ms, and as putative fast-spiking interneurons if shorter than 0.4 ms or if units had more than twice the firing rate during 8-mW photostimulations than during periods of no stimulation.

With the light powers we used in the main experiments (4 mW each hemisphere), light within a 1-mm distance reduced the mean spike rate of putative pyramidal cells (Extended Data Fig. 3c–e) by 91%, light at approximately 1.5 mm away by 61%, and at approximately 3 mm away by 19% in behaving animals (Extended Data Fig. 3f). The mean spike rate of putative fast-spiking neurons at approximately 3 mm away was also reduced by 19%, rather than showing an increase due to photoactivation, suggesting that the decreased activity of both pyramidal and fast-spiking neurons was probably due to a reduction of cortical input. By contrast, light shined within 1 mm increased the mean spike rate of fast-spiking neurons by 739% and at approximately 1.5 mm by 140%.

Histology

Mice were perfused transcardially with PBS followed by 4% PFA in 0.1 M PB. The tissue was fixed in 4% PFA at least overnight. The brain was then suspended in 3% agarose in PBS. A vibratome (HM 650V, Thermo Scientific) cut coronal sections of 100 μm that were mounted and subsequently imaged on a fluorescence microscope (BX41, Olympus). Images showing DiI and DiD fluorescence were collected to recover the location of silicon probe recordings. The plotted coordinates of recording sites (Fig. 2d) were randomly jittered by ±0.05 mm to avoid visual overlap.

General data analysis

All analyses were performed in MATLAB (MathWorks) version 2019b unless noted otherwise.

The first trial and the last trial were always removed due to incomplete data acquisition. Trials in which mice did not finish the sequence before video recording stopped were excluded from the analyses that involved kinematic variables of tongue motion.

We assigned mice of appropriate genotypes to experimental groups arbitrarily, without randomization or blinding. We did not use statistical methods to predetermine sample sizes. Sample sizes are similar to those reported in the field.

Behavioural quantifications

The duration of individual licks was variable. To average quantities within single licks (Fig. 1, Extended Data Figs. 1, 2, 6), we first linearly interpolated each quantity using the same 30 time points spanning the lick duration (from the first to the last video frame of a tracked lick). L′ was computed before interpolation. When the tongue was short, the regression network showed greater variability in determining θ and sometimes produced outliers. Thus, we detected and replaced outliers using the MATLAB ‘filloutliers’ function (with ‘nearest’ and ‘quartiles’ options), and only included θ when L was longer than 1 mm. In addition, any ‘lick’ with a duration shorter than 10 ms was excluded.

For licks occurring at the most lateral positions, the tongue would typically ‘shoot’ out and quickly but briefly reach a maximal deviation from midline (|θ|max) (Extended Data Fig. 1g). As a result, the onset of touch mostly occurred around |θ|max. When analysing licks that may or may not have contact, we used θshoot, defined as the θ when L reached 0.84 maximal L (Lmax), to succinctly depict the lick angle (Extended Data Fig. 1g).

The instantaneous lick rate was computed as the reciprocal of the inter-lick interval (ILI). The instantaneous sequence speed was defined as the reciprocal of the duration from the touch onset of a previous port position to the touch onset of the next.

Values in the learning curves (Extended Data Figs. 1l, m, 2a, b) were averaged in bins of 100 trials, with 50% overlap of consecutive bins.

The behavioural effects of photoinhibition (Extended Data Fig. 3j–m) were quantified in two steps. First, we used 0.2-s time bins to compute Θshoot, Lmax, the rate of licks and the rate of touches as functions of time for each trial. The time series of SD(Θshoot) was computed from binned Θshoot across trials in each experimental condition and each session. Second, bins within a time window during photoinhibition (or equivalent time for trials without inhibition) were averaged to yield a single number. The time window was typically 1 s following the start of photoinhibition. The shorter window helped to minimize the effects ‘bleeding over’ from mid-sequence to initiation, and from consumption to mid-sequence. However, this was not an issue for the consumption period, and we instead used the 2-s window during which light was delivered (Fig. 2c, right; ‘Cons’ in Extended Data Fig. 3k, m). Figure 2b, c presents the same results quantified in Extended Data Fig. 3j,k but directly plotting changes in means between conditions on schematic brain images.

Standardization of ILIs within lick bouts

Owing to individual variability, different mice tended to lick at slightly different rates within lick bouts. The same mouse might also perform a bit faster in one sequence direction than the other. Even in a given direction, a mouse might start faster and then slow down a little, or go slower first and faster later. When aligning trials from heterogeneous sources, a 10% difference in lick rate, for instance, will result in a complete mismatch (reversed phase) of lick cycle after only five licks. Therefore, before the analyses that were sensitive to inconsistent lick rates (Figs. 2e–h, 3, 4, Extended Data Figs. 47, except for Extended Data Fig. 4f–h), we linearly stretched or shrunk ILIs within each lick bout to a constant value of 0.154 s (that is, 6.5 licks per second), which is around the overall mean. The lick timestamps used to compute ILIs were the mid-time of the duration of each lick. A lick bout was operationally defined as a series of consecutive licks in which every ILI must be shorter than 1.5× the median of all ILIs in the entire behavioural session. ILIs outside lick bouts were unchanged. For ease of programming, we compensatorily scaled the time between the last lick of a trial and the start of the next trial to maintain an unchanged global trial time. Original time series, including spike rates and L′, were obtained before standardizing ILIs. After standardization, the behavioural and neural time series were resampled uniformly at 400 samples per second.

Trial selection for standard and backtracking sequences

After standardizing lick bout ILIs, we used a custom algorithm to select a group of trials with the most similar sequence performance. First, all trials of the same sequence type in a behavioural session were collected and a time window of interest was determined. In Fig. 2e–h and Extended Data Fig. 4, we used 0–0.5 s from cue onset, −1 to 1 s from middle touch, and −0.5 to 0.7 s from last consummatory touch for the respective periods. In Fig. 3, we used −1 to 1 s from middle touch. In Extended Data Fig. 6, we used −0.5 to 1 s from the first lick touching water. Next, for each trial, we created three time histograms (with a 10-ms bin size): one for all licks, one for all touches and one for touches that triggered port movements. The three time histograms were then smoothed by a Gaussian filter (100-ms kernel width, 20-ms s.d.). Concatenating them along time gave a single feature vector that depicts the licking pattern and performance for the trial. Last, pairwise Euclidean distances were computed among feature vectors of all candidate trials and we chose a subset of n trials with the lowest average pairwise distance, that is, those that have the most similar lick and touch patterns. The number n was set to one-third of the available candidate trials with a minimal limit of n = 10 trials. We used this relatively low fraction mainly to handle the greater behavioural variability in sequences with backtracking. To handle trial-to-trial variability in sequence initiation time (defined as the interval from the cue onset to the onset of the first touch), which was not captured in our feature vectors, before clustering we limited trials to those with a sequence initiation time of less than 1 s.

Trial selection and subsequence matching for zigzag sequences

After standardizing lick bout ILIs, we limited candidate trials to those with perfect sequence execution, that is, no missed licks or breaks. To find the time shift that gave the best match between two subsequences, as illustrated in Fig. 4c, we first computed the median time series of tongue angles (θ) for each of the two sequence types. Next, we identified the best time shifts as those corresponding to the peaks of a cross-correlogram between the two time series.

Analysis of zigzag sequences was intended to reveal whether neurons encoded sequence context (that is, identity) during periods with the same subsequence movements. To aid this purpose, we further selected trials whose θ were closest to the median θ computed from trials of either sequence type pooled together, unless the resulting number of trials was less than one-third of all candidate trials.

Hierarchical bootstrap

Directly averaging trials pooled across animals assumes that data from different animals, acquired in different sessions, come from the same distribution. Potentially meaningful animal-to-animal and session-to-session variability is thereby underestimated. To account for this variability, where noted, we performed a hierarchical bootstrap procedure54 when computing confidence intervals and performing statistical tests. In each iteration of this procedure, we first randomly sampled animals with replacement, then, from each of these resampled animals, sampled sessions with replacement, and then trials from each of the resampled sessions. The statistic of interest was then computed from each of these bootstrap replicates.

PETH and NNMF clustering

Spike rates were computed by temporal binning (bin size of 2.5 ms) of spike times followed by smoothing (15-ms s.d. Gaussian kernel). The smooth PETHs were computed by averaging spike rates across trials. Each unit had six PETHs: three time windows (for sequence initiation, mid-sequence and sequence termination) each in two standard sequences (left to right and right to left). We excluded inactive units whose maximal spike rate across the six PETHs was less than 10 spikes per second. For the rest, we normalized PETHs of each unit to this maximal spike rate.

To evaluate the consistency of neuronal spiking across trials, we quantified the uncertainty in PETHs using a variant of bootstrap cross-validation. Specifically, for each neuron and in a given run, we randomly split the trials into two halves and computed PETHs with each half. We then computed the root mean squared error (RMSE) between the two sets of PETHs, producing a single RMSE value. This procedure was performed for every neuron and was repeated 200 times. The mean RMSE value for each neuron across the 200 runs is shown in Extended Data Fig. 4i.

To construct inputs to non-negative matrix factorization (NNMF), the six PETHs of each unit were downsampled from 2.5 ms per sample to 25 ms per sample and were concatenated along time to form a single feature vector.

NNMF is a close relative of principal component analysis (PCA) and has gained increasing popularity for processing neural data55. The algorithm finds a small number of activity patterns (non-negative left factor, analogous to principal components in PCA) along with a set of weights for each neuron (non-negative right factor), so that the original PETHs can be best reconstructed by weighted sums of those activity patterns. As a result, a small number of activity patterns (or dimensions) is usually able to capture the main structure of the original PETHs, and the weights of the neuron quantify the degree to which its activity reflects each pattern. In the context of clustering, each pattern describes representative activity of a cluster, and the pattern with the greatest weight for a neuron determines its cluster membership.

NNMF was performed using the MATLAB function ‘nnmf’ with default options. To find the best number of clusters, we tested a range of numbers with bootstrap cross-validation to see what cluster number produced the most consistent cluster membership. In each bootstrap iteration, NNMF with a given cluster number was applied using 50% of randomly sampled neurons. The extracted activity patterns were used to compute cluster memberships for the other 50% of neurons that were held-out. This process was repeated 1,000 times. The final cluster membership of a neuron was the one that had the highest likelihood of containing that neuron. We ran this method with the number of clusters set to each value from 6 to 20, and found that 13 clusters achieved the best consistency (Extended Data Fig. 4j), quantified as the mean likelihood that a neuron was grouped in the same cluster across all bootstrap iterations.

Quantification of rhythmic licking modulation in spike PETHs

Neuronal responses modulated by rhythmic licking should show a modulation frequency that matches the rate of licks (approximately 6.5 licks per second during sequence execution), with a phase shift that may vary from neuron to neuron. Therefore, we first quantified the rhythmicity by fitting a sinusoidal function, f(t) = A × sin(2πωlickt + Φ) + C, to each PETH (Extended Data Fig. 4n), where the free parameter Φ shifts the function in phase, A and C scale and offset the function vertically to match the neuronal firing rate, and ωlick is a constant of 6.5. Next, a Pearson’s correlation coefficient (r) was computed between a mid-sequence PETH and its best-fitted sinusoids. Every neuron had two r values, one for each sequence direction. The final rhythmicity was represented by the average of the two (ravg).

PCA

The input to PCA was the normalized spike rates of simultaneously recorded single units and multi-units (Extended Data Fig. 4d). The original spike rates were first computed by temporal binning (2.5-ms bin size, that is, 400 samples per second) of spike times followed by smoothing (15-ms s.d. Gaussian kernel). To obtain normalized spike rates, we divided the original spike rates by the maximum spike rate or 5 Hz, whichever was greater. We adopted this ‘soft’ normalization technique29 to prevent weakly firing units from contributing as much variance as actively firing units. The percent variance explained by principal components was simply derived from the singular values.

Linear regression and decoding

A linear model can be expressed as

$${y}_{t}={w}^{1}{r}_{t}^{1}+{w}^{2}{r}_{t}^{2}+{w}^{3}{r}_{t}^{3}+\ldots +{w}^{n}{r}_{t}^{n}+c+{\in }_{t}={{\bf{r}}}_{t}^{{\rm{\top }}}{\bf{w}}+c+{\in }_{t}$$

where t is the time in a recording session, n is the number of simultaneously recorded units, yt is the behavioural variable at t, \({r}_{t}^{i}\) is the normalized spike rate of the i-th unit at t, \({w}^{i}\) is the regression coefficient for the i-th unit, c is the intercept, \({\in }_{t}\) is the error term, and \({{\bf{r}}}_{t}^{{\rm{\top }}}{\bf{w}}\) is the matrix notation form of the summed multiplications.

The normalized population spike rates were computed in the same way as those for PCA. Note that, although the normalization was only necessary for PCA, it did not affect the goodness of fit, R2, of linear models. The behavioural variable was either tongue length (L), tongue velocity (L′), tongue angle (θ), sequence identity (I), target position (TP) or relative sequence time (τ) (Fig. 3a, Extended Data Figs. 5, 7). L, L′ and θ were directly available at 400 samples per second. However, these variables had values only when the tongue was outside of the mouth. Therefore, samples without observed values were either set to zero (for L) or excluded from regression (for L′ and θ). I was defined as 1 if the sequence was from right to left and 2 if left to right. τ simply took sample timestamps as its values. TP was the same as I but defined based on the upcoming sequence.

Predicting single responses with dozens of predictors is prone to overfitting. Therefore, we chose the elastic-net56 variant of linear regression (using the MATLAB function ‘lasso’ with ‘Alpha’ set to 0.1), which penalizes big coefficients for redundant or uninformative predictors. A parameter λ controls the strength of this penalty. To find the best λ, we configured the lasso function to compute a tenfold cross-validated mean squared error (cvMSE) of the fit for a series of λ values. The smallest cvMSE indicates the best generalization, that is, the least overfit. We conservatively chose the largest λ value such that the cvMSE was within one standard error of the minimum cvMSE. For each model, we derived the R2 from this cvMSE and reported it in Fig. 3 and Extended Data Figs. 5, 7.

Linear decoding can be expressed as

$${\hat{y}}_{t}={w}^{1}{r}_{t}^{1}+{w}^{2}{r}_{t}^{2}+{w}^{3}{r}_{t}^{3}+\ldots +{w}^{n}{r}_{t}^{n}+c={{\bf{r}}}_{t}^{{\rm{\top }}}{\bf{w}}+c$$

where \({\hat{y}}_{t}\) is the decoded behavioural variable at t, w and c are the coefficients obtained from regression, and \({{\bf{r}}}_{t}\) is the vector of normalized population spike rates at t. We did not perform additional cross-validation in decoding because (1) 30% of the decoding for standard sequences (0.5–0.8 s in Fig. 3 and −1.3 to −1 s in Extended Data Fig. 7) was from new data; (2) all decoding in backtracking sequences and during consumption periods was from new data; and (3) the model has been proven to be the best generalization via cross-validation when selecting λ.

The matrix notation form of the equation, rTw, shows that the linear decoding can be geometrically interpreted as projecting the vector of population spike rates r onto the axis in the direction of vector w, and reading out the length of the projection (scaled by ||w||, plus the intercept c). We therefore referred to this axis as the coding axis. To compute the variance explained for each coding axis, we first obtained its unit vector and projected population spike rates onto it. The variance of the projected values is Var(explained). The total variance, Var(total), of the population activity is the sum of variance of all units. Finally, variance explained equals Var(explained) / Var(total) × 100%.

Support-vector machine classification

First, to prepare a denoised version of the predictors for more robust classification, we performed PCA with normalized population spike rates, and projected the spike rates onto the first 12 principal components. The projected activity was then downsampled from 400 to 66.7 samples per second (Fig. 3f) or 200 samples per second (Fig. 4f, g) to reduce subsequent computation time. Class labels were the sequence identity values, including standard versus backtracking types (Fig. 3f), or the two types of zigzag sequence (Fig. 4).

Classification was performed independently for each time bin with the MATLAB ‘fitcsvm’ function. Linear kernels were used for all classifications. Trials were weighted so that the chance classification accuracy was 0.5 even if the two classes did not have equal numbers of trials. The results were computed with tenfold cross-validation. All other function parameters were kept as the defaults. The null classification results were obtained using the same procedure but with randomly shuffled class labels.

Canonical correlation analysis

The canonical correlation analysis seeks linear transformations of two vectors of random variables such that the Pearson’s correlation coefficients between the transformed vectors are maximized:

$$\mathop{\text{arg}\,max}\limits_{{a}_{i},{b}_{i}}{\rm{c}}{\rm{o}}{\rm{r}}{\rm{r}}({{\bf{U}}}_{{\bf{i}}}{{\bf{V}}}_{{\bf{i}}}),\,{{\bf{U}}}_{{\bf{i}}}={{\bf{a}}}_{i}^{{\rm{\top }}}{\bf{X}},\,{{\bf{V}}}_{{\bf{i}}}={{\bf{b}}}_{i}^{{\rm{\top }}}{\bf{Y}},\,i=1,2,\ldots ,n$$

where X and Y are vectors of random variables, \({{\bf{a}}}_{i}\) and \({{\bf{b}}}_{i}\) are transformation vectors for the i-th iteration, and n is the number of dimensions in X or Y, whichever is smaller. Matrices A and B will be used to represent the concatenated transformation vectors across all iterations.

In the present analysis, X and Y were matrices of sampled data for each session. X contained the time series of the decoded behavioural variables (L, L′, θ, I, τ; zero centred). Y contained the projection of neural activity onto the top principal components obtained from PCA. We focused our analysis on standard sequences, with a time window of −0.5 to 0.8 s relative to the middle touch. The linearly decoded or principal component-projected data were averaged across trials with the same sequence direction. Averaged data from the two sequence directions were concatenated along time.

Canonical correlations were computed using the MATLAB ‘canoncorr’ function between matrices with a selected subset of dimensions. In Fig. 3d, Y was transformed using AT−1BTY so that the pattern could be best aligned with the patterns of X. In Fig. 3e, n correlation coefficients (r) quantified the correlation between each pair of Ui and Vi. The average r across the n values reflected the overall alignment between the two transformed matrices.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.