Introduction

Because of working memory’s limited capacity, we need to flexibly recruit long-term memory to accomplish our everyday goals. For example, when performing a routine grocery shopping trip, it is not feasible to hold 20 different items actively in working memory. However, by taking advantage of associations in long-term memory, we can strategically retrieve “chunks” of items to effectively shop for all 20 items (e.g., “I should buy the ingredients I need to make caesar salad, lasagna, and tiramisù”) (Bower, 1972; Cowan, 2001; Ebbinghaus, 1885, 1913). Arguably, scenarios like this one are the most common way that we use working memory in the real world. We frequently need to flexibly shuttle information back and forth between working- and long-term memory. Doing so allows us to take advantage of the strengths of each memory system 一 Information in working memory is easily accessible and manipulable but capacity-limited (Baddeley & Hitch, 1974; Cowan, 2001; Luck & Vogel, 1997), whereas information in long-term memory is nearly capacity unlimited but often takes time and effort to retrieve (Beck & van Lamsweerde, 2011; Brady et al., 2008; Mandler & Ritchey, 1977; Squire & Zola, 1996; Standing, 1973; Standing et al., 1970; Wolfe et al., 2023).

Probing interactions of visual working and long-term memory

Although working and long-term memory are typically used in tandem, researchers make an effort to devise working memory experiments that prevent the contribution of long-term memory so that the unique constraints of working memory can be better characterized. For example, visual working memory is often studied by asking people to remember simple shapes or colors. Because these abstract items have low meaningfulness, these tasks help us to estimate the capacity of working memory in the absence of support from long-term memory associations.Footnote 1 Over the past many decades, careful work dissociating working and long-term memory has been important for our understanding of these memory systems and their neural correlates (Baddeley & Warrington, 1970; Christophel et al., 2017; Jeneson & Squire, 2011; Luck & Vogel, 1997; Milner & Penfield, 1955; Scoville & Milner, 1957; Serences et al., 2009). However, by focusing primarily on each memory system in isolation, we may miss important insights about how information is flexibly shuttled between both working and long-term memory in everyday life. One approach to closing this gap has been to use more realistic stimuli in working memory tasks, such as photographs of real-world objects and scenes (Brady et al., 2016; Brady & Störmer, 2022; Endress & Potter, 2014; Quirk et al., 2020). One advantage of these stimuli is that they may allow participants to draw on long-term memory via familiarity and meaningful associations (Asp et al., 2021; Jackson & Raymond, 2008; Ngiam et al., 2019b; Reder et al., 2013; Xie & Zhang, 2017, 2022). However, a disadvantage of using real-world objects is that they have associations in long-term memory that are already pre-formed, and the experimenter cannot directly control or observe how the formation of long-term memory associations influences working memory performance.

Rather than changing the memoranda to be more realistic, a second approach for studying interactions of working and long-term memory is to use artificial stimuli, but to introduce controlled opportunities for long-term memory to aid performance. In this vein, prior work has examined how incidental repetitions of memoranda may improve visual working memory performance (i.e., via Hebbian or implicit learning). Surprisingly, initial work on this topic found that visual working memory capacity was stubbornly resistant to improvement (Fukuda & Vogel, 2019; Logie et al., 2009; Olson & Jiang, 2004). For example, Olson and Jiang (2004) found that even after 24 repetitions of the same memory array, participants performed no better than as if they were seeing the array for the first time. The lack of effect of repetitions on visual working memory performance is puzzling, because it contrasts with a rich body of work that shows that memory for verbal memoranda is improved with incidental repetitions (Hebb, 1961; Page et al., 2013; Sukegawa et al., 2019). As such, recent work has begun to systematically explore which factors may prevent versus allow Hebbian learning from incidental repetitions of visual working memory arrays (Musfeld et al., 2023a, 2023b; Souza & Oberauer, 2022). For example, Musfeld et al. (2023b) found that retrieval practice and the expected difficulty of the test both influence whether or not working memory performance improves when arrays are incidentally repeated over time.

Here, we turned our focus from incidental to explicit repetitions of working memory arrays. Explicit repetition of visual working memory arrays has been infrequently examined, so our main goal was to characterize how quickly and to what extent participants can use long-term memory to overcome visual working memory capacity limits when directed to do so intentionally. Indeed, prior work suggests that working memory plays a particularly important role when learning is intentional as opposed to incidental. For example, Unsworth and Engle (2005) found that individual differences in working memory capacity predicted learning in a serial reaction time task in conditions with intentional, but not incidental, learning. To this end, we devised an experimental paradigm to probe the explicit coordination of working and long-term memory. Specifically, we simply instructed participants that the same visual array would repeat for many trials in a row, and that they should use any strategy available to them to improve their performance across repetitions. We predicted that we should initially find that participants’ performance is bound by typical working memory capacity limits (i.e., ~3 items), but after many repetitions participants may begin to use long-term memory to augment performance.

Inter-response times as a measure of chunk formation

In addition to examining how accuracy improves with repetitions, we also planned to examine how response latencies may track the formation of new “chunks” in a visuospatial memory task. Here, we used a “whole-report” visual working memory task, where participants are asked to report the color of all memory items. Because participants report multiple items, we may examine not only the number of correctly reported items, but also how quickly participants make individual responses. In particular, we were inspired by prior work measuring inter-response times, defined as the time between pairs of responses as participants make many responses in a row (Anderson & Matessa, 1997; Broadbent, 1975; Browman & O’Connell, 1976; Chase & Ericsson, 1982; Chase & Simon, 1973; Lovelace & Snodgrass, 1971; Lovelace & Spence, 1972; Murdock & Okada, 1970; Reitman, 1976). During free recall and search through long-term memory, inter-response times tend to increase as the memory setFootnote 2 is exhausted (Bousfield & Sedgewick, 1944; Murdock & Okada, 1970; Rohrer, 1996; Wixted & Rohrer, 1994). In addition to this general slowing over time, clustering of inter-response times can be a useful, quantifiable signature of chunk utilization, whereby “intra-chunk” response times are faster than “inter-chunk” response times (McLean & Gregg, 1967). McLean and Gregg (1967) articulated a framework in which chunks may be formed in three key ways: (1) via prior knowledge, (2) via grouping cues during encoding, and (3) via top-down associations (i.e., new associations formed by attention, rehearsal, or some other process).Footnote 3 This framework has remained important for later theories of the role of chunking in working memory (e.g., Cowan, 2001).

Inter-response time signatures of chunking have previously been observed when chunks are formed via prior knowledge or during encoding. First, studies examining recall of previously learned sets (e.g., countries of Europe) have found slowing of inter-response times in clusters of three to four items (Broadbent, 1975; Graesser & Mandler, 1978). Second, studies introducing grouping cues during encoding of letter and digit sequences have found a slowing of inter-response times when the recall of a new group begins (Anderson & Matessa, 1997; McLean & Gregg, 1967). Few studies, to date, have looked at inter-response time signatures of chunking when groups are formed via new top-down associations, as we plan to do here (Miller & Unsworth, 2018). However, other putative signatures of chunk utilization have been observed when observers repeatedly recall a word list multiple times in a row (i.e., “multitrial free recall”). Rather than response times, prior work has examined response consistency during multitrial free recall. Namely, when participants encode the same word list multiple times in a row (with the words presented in a randomized order during each list presentation), participants begin to recall the items in a consistent order each time they recall the list (Dunlosky & Salthouse, 1996; Miller & Unsworth, 2018; Sternberg & Tulving, 1977; Tulving, 1962, 1966).

In the current study, we were particularly interested in observing how chunks formed via new, top-down associations may benefit performance in a visual memory task. Our whole-report task with explicit repetitions is a novel, visuospatial analogue of the verbal “multitrial free recall” task. The present study will test if classic behavioral signatures of chunk utilization that have been established with verbal memoranda will generalize to visuospatial tasks. Few studies have examined inter-response times in the context of visual memory (with the notable exception of expertise and chess; Chase & Simon, 1973; Reitman, 1976), in part because of the popularity of change detection measures of visual memory that collect only a single response on each trial (e.g., Luck & Vogel, 1997). In the present experiments, we predicted that when participants recruit long-term memory to improve performance beyond typical capacity limits, we would see behavioral signatures of chunking such as “pauses” in the inter-response times and/or increased consistency of response order during recall.

Summary

To preview results, we found a rapid and robust effect of explicit repetitions on performance. Even after only one repetition, participants’ performance exceeded typical capacity limits. By the eighth and final repetition of an array, participants had a modal performance that was perfect (six to eight items correct). In addition, an analysis of inter-response times is consistent with the idea that participants organize their responses by retrieving ‘chunks’ from long-term memory. Together, these findings illustrate how long-term memory may rapidly assist cognition in tasks that overwhelm working memory capacity, and that inter-item response times can be used to track the formation and deployment of chunking strategies in visual memory tasks.

Experiment 1: Six-item arrays

Methods

Participants

Participants were recruited from the University of Chicago and the surrounding community. Participants provided informed consent under procedures approved by the University of Chicago Institutional Review Board. All participants (27 female, 25 male) were 18 years or older (M = 21.88 years, SD = 3.66 years, range = [18,36]), had self-reported normal color vision and normal or corrected-to-normal visual acuity, and received course credit or cash ($10/h) for their participation. A total of 52 participants took part in the study. Data from two participants were excluded for failure to comply with task instructions (i.e., chance-level performance), leaving a total of 50 participants for analysis. The study procedures were not pre-registered, and the sample size was determined by convenience (i.e., data collection up to a conference deadline). With 50 subjects, we would be powered to detect medium within-subjects effects at 90% power (e.g., within-subjects t-test, critical t(49) = 2.01, dz = .47; repeated-measures ANOVA with one within-subjects factor (e.g., repetition, eight levels), critical F(7,343), 2.04, Cohen’s f = 0.15, η2 = 0.02) (Faul et al., 2007; Kim, 2016).

Stimuli

Participants were seated in a dimly lit room and viewed a 24-in. BenQ LCD monitor with a 1,920 x 1,080 resolution from a distance of ~67 cm. Stimuli were presented with MATLAB (The MathWorks, Natick, MA, USA) and the Psychophysics toolbox (Brainard, 1997; Kleiner et al., 2007). A fixed set of nine highly discriminable colors were used for the colored square stimuli in all three memory tasks (red: 255 0 0; green: 0 255 0; blue: 0 0 255; yellow: 255 255 0; magenta: 255 0 255; cyan: 0 255 255; white: 255 255 255; black: 1 1 1; orange: 255 128 0), and colors were always chosen without replacement for each memory array. Throughout each task, a black fixation dot was drawn at the center of the screen (radius = 6 pixels, 0.14°) and stimuli were presented on a medium-gray background (RGB = 85 85 85).

Discrete whole-report task with repetitions

A total of 30 unique arrays were generated by picking six semi-random locations and assigning a unique color (drawn from the set of nine possible colors) to each location. The locations were semi-random in that they were chosen with some constraints, such that items were separated by a minimum distance of 36 pixels (~0.9° of visual angle) and were split evenly across the left and right hemifields. Each colored square had a diameter of 72 pixels (~1.7°) and the possible locations were in a portion of the screen centered on fixation and subtending 1,066 x 600 pixels (7.1° above/below fixation and 12.6o left/right of fixation).

Surprise long-term memory recognition task

For the long-term memory recognition task, we showed participants a total of 60 arrays (30 old, 30 new). The old arrays were identical to those used in the whole-report task. The new arrays were randomly generated with the same size, color, and location constraints as in the working memory task.

Color change detection task

On each trial, a new array containing four, six, or eight colored squares was drawn. The stimuli were the same size and drawn in the same nine colors as the whole-report task, and the same location constraints were used.

Procedures

Participants completed a discrete whole-report task with array repetitions, a surprise long-term memory recognition task for the arrays presented in the whole-report task, and a color change detection task. These three tasks were presented in a fixed order for all participants.

Repeated-arrays working memory task

Participants completed a variant of a discrete whole-report working memory task (Adam et al., 2015; Huang, 2010) in which arrays repeated eight trials in a row. On each trial, participants saw a briefly presented array of six colored squares (150 ms). After a short delay (1 s), participants reported the colors of the squares. A 3 x 3 grid of possible color choices appeared at each location, and participants selected the color that belonged at each response grid location. Participants were required to make a response to all six squares before they could advance to the next trial. After the last response was made, the next trial began after an inter-trial interval of 1 s. Critically, the same array was repeated eight times in a row. On the first trial of a set of repetitions, a new configuration of square colors and locations was randomly chosen. This array was then used for the next seven working memory trials in a row (i.e., trials 1–8 were array #1, trials 9–16 were array #2, etc). Participants were given explicit instructions that each unique array would be repeated for eight trials in a row, and that they should try to improve their performance across the eight repetitions. Participants completed a total of 240 working memory trials (eight trials each of 30 unique arrays).

Old-new recognition task

After completing the repeated arrays working memory task, participants completed an old-new recognition task for the 30 arrays that were used in the repeated arrays working memory task. Participants were not informed beforehand that they would be tested on their long-term memory of the arrays in the previous task. On each trial, participants viewed an array of colored squares. On half of the trials, the participants were shown an old array (an array that was previously seen in the discrete whole-report task). On the other half of the trials, the participants were shown a new, randomly generated array with the same stimulus constraints (i.e., six colored squares drawn at new random locations). Participants reported via keypress if they thought the array was “old” (“Z” key) or “new” (“/” key) and they reported their confidence about the decision on a 5-point scale (using the number keys 1–5 on the keyboard). All responses were unspeeded.

Color change detection task

To assess baseline working memory performance with an independent task, we used a standard color change detection task (Luck & Vogel, 1997). On each trial, participants saw a briefly presented array of four, six, or eight colored squares (150 ms), and remembered the colors of the squares across a blank delay (1 s). At test, a memory probe was shown at one of the squares’ locations. On half of the trials (“same” trials), the probe was the same color as the remembered item at the same location. On the other half of trials (“different” trials), the probe was a different color. Participants responded via keypress whether they thought the probe square was the same (“Z” key) or different (“/” key) from the remembered color of the square presented at the probe’s location. Participants completed a total of 180 trials of the color change detection task (60 trials per set size).

Analysis

Analyses were performed using MATLAB 2018a (The MathWorks) and Python 3.9.7 (conda 4.12.0). Data from the raw .mat files were processed in MATLAB and converted into aggregate .csv files for the main analyses in Python. Key open source packages for Python analyses include Jupyter (Kluyver et al., 2016), pandas (McKinney, 2010), seaborn (Waskom, 2021), pingouin (Vallat, 2018), and pymer4 (Jolly, 2018) Fig. 1.

Results

Performance rapidly increased across array repetitions

To characterize how performance changed as a function of repetition, we first analyzed mean performance (Fig. 2A). In the whole-report task, performance is quantified as the number of locations for which participants correctly recalled the item’s color, and this value ranges from 0 to 6 on each trial. The first time the participants saw an array (Repetition 1), mean performance was in line with typical estimates of working memory capacity (M = 2.79 items correct, SD = 0.45). Mean performance significantly increased across repetitions, F(7,343) = 330.6, p < 1x10-45, ηp2 = 0.871.Footnote 4 By the final repetition, participants’ performance was near ceiling and had nearly doubled from the first repetition (M = 5.32, SD = 0.91). To quantify the rate of performance improvement on average, we calculated difference scores for adjacent repetitions (e.g., Repetition 2–1, Repetition 3–2, etc.). On average, participants’ performance improved by 0.36 items per repetition (SD = .11), with faster learning across the first four repetitions (M = .71, SD = .24) compared to the last four repetitions (M = .10, SD = .09), t(49) = 16.4, p < 1x10-20. In Experiment 1, the ceiling was six items correct. As such, the slowing of learning at later repetitions may have been driven by participants hitting the performance ceiling for the task.

Fig. 1
figure 1

(A) Repeated arrays working memory task. On each trial, participants remembered an array of colored squares across a blank delay. At test, participants used the mouse to report which color was presented at each of the locations. For example, if the upper right square was blue, the participant clicked the blue portion of the mask. Each memory array was repeated for eight trials in a row. Participants were instructed to try to improve their performance across the eight repetitions. (B) Old-new recognition task. After completing the repeated arrays working memory task, participants were given an old-new recognition task (50% old arrays from the previous task, 50% new arrays). Participants reported whether they thought the array was old or new, as well as their confidence in their decision (5-point scale)

Fig. 2
figure 2

Working memory performance as a function of array repetition in Experiment 1. (A) Mean number correct as a function of repetition. The number of correctly recalled items increased dramatically from the first repetition to the later repetitions. The purple line indicates the mean performance increase; thin gray lines depict individual subjects; shaded error bars represent 68% confidence intervals (approximately equivalent to standard error). (B) Distribution of performance outcomes as a function of repetition. On the first repetition, participants’ performance resembled typical working memory performance (mode = three items correct; few or no responses with six items correct). By the third repetition, the modal tendency was six out of six correct, far exceeding typical working memory capacity limits. Shaded error bars indicate 68% confidence intervals

In addition to looking at mean performance for each repetition, we also looked at the distribution of performance outcomes (Fig. 2B). One notable aspect of the performance distributions is the increase in the number of trials where participants correctly recalled six out of six items. In a typical whole-report working memory task, participants rarely get six items correct, and these rare “perfect” trials can be explained by guessing inflation (i.e., participants never really store six items, but they sometimes get lucky and get six correct by chance because they are required to make a response to every item; see Adam et al., 2015). The first time participants saw an array (Repetition 1), we found a similar pattern of performance. There was a strong modal tendency toward getting three items correct, and participants very rarely got all six items correct (M = 0.6%, SD = 1.87%). As early as the second encounter with the array (Repetition 2), the number of perfect trials increased 25-fold (from 0.6% to 15%). By the final encounter with the array (Repetition 8), the modal tendency was six out of six correct (M = 65.5%, SD = 24.6%).

Inter-response times and chunk utilization

Prior work has hypothesized that inter-response times can be used as a signature of retrieving a new chunk from long-term memory (e.g., Broadbent, 1975). Inter-response times are calculated as the time in between successive responses, and a long pause may indicate that a participant is engaging in planning for the next series of responses and/or retrieving information from long-term memory. The inter-response times are shown in Fig. 3A. The response time was longest the first time participants saw an array (Repetition 1), and the successive responses became quicker. Starting on the second repetition of the array (Repetition 2), we observed a marked slowing at the fourth response. A two-way repeated-measures ANOVA with within-subjects factors Response Number and Repetition confirmed that there was a significant interaction of Response Number and Repetition on inter-response times, F(35,1715) = 6.34, p < .001, ηp2 = .11. To better understand the effect of repetition on inter-response times, we conducted separate one-way repeated-measures ANOVAs with factor Repetition separately for each response. To visualize the meaning of these tests, the data for each response are replotted in separate subplots in Fig. 3B.

Fig. 3
figure 3

Inter-response times as a function of response number and repetition number in Experiment 1. (A) Inter-response times plotted with response number on the X-axis and separate lines for repetition. Asterisks indicate that there was a significant effect of repetition on inter-response times for a one-way ANOVA for that response number (uncorrected; ** p < .01, *** p < .001). (B) Inter-response times re-plotted with repetition on the X-axis and separate subplots for each response. Error bars represent 68% confidence intervals (approximately equivalent to 1 standard error of the mean)

We found significant effects of repetition on inter-response times for the first response, F(7,343) = 8.81, p < .001, ηp2 = .15, for the fourth response, F(7,343) = 5.27, p < .001, ηp2 = 0.10, for the fifth response F(7,343) = 8.12, p < .001, ηp2 = 0.14, and for the sixth response, F(7,343) = 4.94, p < .001, ηp2 = 0.09. In contrast, there was no effect of repetition on inter-response times for Response 2 (p = .92) and Response 3 (p = .74). This pattern of response times would be consistent with a chunking strategy where participants formed an initial chunk of three items on Repetition 1 that they used throughout the subsequent repetitions. However, starting at Repetition 2, participants appear to have become more efficient at using their already formed chunk (faster response times for response 1 with repetition) and devote extra time during the fourth response to form and recall a second chunk of three items (slower response times for response 4 with repetition).

In addition to response times, we also examined whether participants recalled items in a consistent order by computing transition probabilities between all pairs of items. A transition probability of 100% would indicate that participants reported a pair of items in the same order for all eight repetitions of the array. We found that participants reported items in an order that is more consistent than would be expected by chance (p < .001), with the highest two transition probabilities exceeding 90% (see Online Supplemental Material (OSM) Analysis S2; Fig. S1A). The empirical pattern that we observed is consistent with an account in which participants formed links between the first three items starting on Repetition 1 (i.e., two transition probabilities: Item 1->2 and Item 2->3), and then developed a consistent response order for the remaining items during later repetitions. Furthermore, participants’ response order was more consistent for the later repetitions of the array (Repetitions 5–8) than for the early repetitions of the array (Repetitions 1–4; Fig. S1B (OSM)). This is consistent with the notion that participants first successfully remembered a few items, and then added in more items as the array was repeated (see also Fig. S2A (OSM)).

Successful recognition of repeated arrays in an old-new recognition task

We hypothesized that participants were able to exceed typical working memory capacity limits by rapidly recruiting long-term memory. Given this hypothesis, we next tested whether participants could reliably distinguish learned arrays from novel arrays in the old-new recognition task. We quantified long-term memory performance as d-prime (d’; Fig. 4A), calculated as the normalized difference between hit rate and false alarm rate (d’ = z(H) - z(FA) where z() is the z-transform) (e.g., Banks, 1970). Overall recognition performance was d’ = 0.45 (SD = 0.40), and this was significantly above chance t(49) = 7.94, p < .001 (one-tailed t-testFootnote 5). For correlations between individual differences in change detection performance, learning rate in the whole-report task, and recognition memory performance, see Analysis S1 (OSM).

Fig. 4
figure 4

Performance in the surprise recognition task. (A) Overall d’: Participants were significantly above chance (d’ > 0) at distinguishing old arrays from new, randomly generated arrays. Purple shaded outline depicts the distribution of the data. White dots show individual subjects’ scores. (B) Histogram of confidence scores (1 = lowest confidence, 5 = highest confidence). The purple line represents the mean proportion that a confidence rating was used, error bars indicate 68% confidence intervals (approximately equivalent to standard error). Gray lines depict individual subject histograms. (C) Performance split by low confidence (< 3) versus high confidence (> 3) for the n = 41 subjects with sufficient trial counts in both confidence bins. Purple shaded outlines depict the distribution of the data. Black line and error bars represent the mean scores and 68% confidence intervals

We next examined whether recognition memory performance varied as a function of confidence. Figure 4B shows the distribution of confidence ratings. Overall, the distribution of confidence scores was fairly even; a one-way repeated-measures ANOVA revealed no difference in the frequency with which participants used each confidence level (p = .12). To examine recognition memory performance as a function of confidence, we divided trials into “low-confidence” (< 3) and “high-confidence” (> 3) bins. Given the total number of trials available for analysis (60), not all participants had sufficient numbers of trials to determine d’ for both the low- and high- confidence bins (i.e., 0 hit or false alarm trials in a given confidence bin). After excluding subjects with insufficient data, there were 41 subjects available for a within-subjects analysis. A paired t-test revealed a significant effect of confidence on memory performance, t(40) = 2.72, p = .009, where memory performance was significantly better for high-confidence trials (M = 1.23, SD = 1.56) compared to low-confidence trials (M = 0.32, SD = 1.24)Footnote 6 (Fig. 4C). Memory performance was significantly above chance for high-confidence trials (p < .001) but not for low-confidence trials (p = .06).

Experiment 2: Eight-item arrays

To replicate our results, we performed a second experiment that closely parallels Experiment 1. The only key change that we made was to increase the set size to eight items instead of six items for each array. By raising the set size, we increased the potential performance ceiling even further beyond typical capacity limits of three to four items.

Methods

Participants

An additional 60 participants were recruited from the University of Chicago and the surrounding community (32 = female, 28 = male; mean age = 21.29 years, SD = 3.20, range = [18,35]). A total of three participants were used in pilot sessions to test the task code and the length of the session (e.g., different numbers of unique arrays and repetitions); five participants were excluded because of incomplete datasets (two computer crashes leading to lost data; three participants did not complete all three of the memory tasks), leaving a total of 52 participants for analysis. This study was designed as a close replication of Experiment 1, but the study procedures were not formally pre-registered; the sample size was chosen to approximately match Experiment 1.

Stimuli and procedures

The stimuli and procedures in Experiment 2 were nearly identical to those in Experiment 1 with the following changes. First, in the repeated working memory task and old-new recognition task, each memory array contained eight squares (rather than six squares) and participants saw a total of 27 unique arrays in the repeated working memory task (rather than 30). Second, participants were given a short survey at the end of the task in which they answered a free-response question about any strategies used and made numerical ratings of the number of items they thought they got correct on average and their overall feelings of effort, boredom, drowsiness, enjoyment, frustration, motivation, challenge, and distraction during the experiment.

Results

Replication of key results: Rapid learning and above-chance recognition

We replicated the main findings that working memory performance rapidly improved across repetitions and that participants could reliably distinguish learned arrays from novel arrays in a later test. Participants’ overall performance rapidly increased across the eight repetitions, F(7,357) = 213.8, p < .001, ηp2 = .81 (Fig. 5A). After the first encounter with a new array (Repetition 1), performance was in line with typical estimates of working memory capacity (M = 2.77 items, SD = 0.51). By the last repetition (Repetition 8), overall performance had more than doubled (M = 6.03 items, SD = 1.54). Figure 5B shows a histogram of trial outcomes (i.e., the proportion of trials where participants got 0 through eight items correct). We again found that performance resembled typical results from whole-report tasks on the first repetition, with a modal tendency of two to three items and few or no trials with perfect performance (mean proportion of trials with eight out of eight correct on Repetition 1 = 0.000%, SD = 0.000). By the final repetition, the modal tendency was eight out of eight correct (M = 40.6% of trials, SD = 26.7%). We again calculated difference scores for adjacent repetitions (e.g., Repetition 2–1, Repetition 3–2, etc.). On average, participants’ performance improved by 0.47 items per repetition (SD = .19), with faster learning across the first four repetitions (M = .73, SD = .39) compared to the last four repetitions (M = .26, SD = .11), t(51) = 8.76, p < 1x10-11.

Fig. 5
figure 5

Working and long-term memory performance in Experiment 2. (A) Improvement in mean performance across repetitions. The blue line represents average performance, with shaded error bars indicating 68% confidence intervals (approximately equivalent to standard error). Thin gray lines represent individual participants. (B) Histogram of trial outcomes, from zero to eight items correctly recalled. (C) Overall recognition memory performance (d’). The blue violin shows the shape of the distribution; the white dots show individual participant values. (D) Distribution of confidence ratings from 1 (lowest confidence) to 5 (highest confidence). The blue line represents the average distribution; the thin gray lines show individual participants’ data. (E) Recognition memory performance split by high (> 3) and low (< 3) confidence. The blue violins show the distribution of the data; the black line with error bars show the mean d’ values with 68% confidence intervals (We used version 0.11.2 of the plotting package seaborn, which only allows one error bar type (confidence interval). We wanted to plot approximately 1 standard error of the mean (SEM), as this is the error bar type we typically use in other published work. The 95% confidence intervals are approximately equal to 1.96 SEMs, and 68% confidence intervals are approximately equal to 1 SEM. Note, the error bars are for visualization purposes only, and do not directly impact the interpretation of results)

In the long-term memory recognition task, participants were above chance at distinguishing old arrays from new arrays (Fig. 5C), d’ = 0.38 (SD = 0.41), t(51) = 6.65, p < .001 (one-tailed t-test). Unlike Experiment 1, participants in Experiment 2 used the confidence scores at unequal rates, F(4,204) = 20.9, p < .001, ηp2 = .29. After excluding participants with insufficient data to quantify d’ for high- (> 3) and low- (< 3) confidence trials, we still had 41 participants remaining for the analysis of d’ as a function of confidence. We again found that d’ was better for high-confidence trials (d’ = 0.96, SD = 2.09) compared to low-confidence trials (d’ = 0.09, SD = 1.58), t(40) = 2.13, p = .039Footnote 7. Overall, memory performance was significantly above chance for high-confidence (p = .003) but not for low-confidence (p = .36) trials.

Flexible chunking strategies for different set sizes

We again found signatures of chunking strategies that changed as a function of repetition when analyzing the inter-response times. Here, however, we found that participants grouped their responses into sets of two rather than into sets of 3 (Fig. 6A). As in Experiment 1, we ran an initial two-way repeated-measures ANOVA with within-subjects factors Response Number and Repetition confirmed that there was a significant interaction of Response Number and Repetition on inter-response times, F(49,2499) = 8.14, p < .001, ηp2 = .14. To understand this interaction, we ran follow-up one-way ANOVAs with factor Repetition for each response, and we replotted the data in Fig. 6B. There was a modest effect of repetition on inter-response times for responses 1, 3, and 4 (p < .05, ηp2 = .06), and no effect of repetition at response 2. There was a robust effect of repetition on inter-responses in the last four repetitions (p < .001), which was particularly pronounced for responses 5 through 7 (ηp2 = .22 - .34). In sum, it seems that participants flexibly adapted their chunking strategy to use groups of two rather than three. Starting at repetition 2, we observed a marked slowing of response times for the fifth response, suggesting that participants began forming a third chunk of two items after just one encounter with the array. In contrast, we did not see significant slowing of the seventh response until repetition 3, suggesting that participants tended to attempt one new chunk formation with each repetition. We repeated the supplemental transition probability analysis, and found results consistent with Experiment 1 (Figs. S2B and S3 (OSM)).

Fig. 6
figure 6

Inter-response times as a function of response number and repetition number in Experiment 2. (A) Inter-response times plotted with response number on the X-axis and separate lines for repetition. Asterisks indicate that there was a significant effect of repetition on inter-response times for a one-way ANOVA for that response number (* p <.05, ** p < .01, *** p < .001). (B) Inter-response times replotted with repetition on the X-axis and separate sub-plots for each response. Error bars represent 68% confidence intervals (approximately equivalent to 1 standard error of the mean)

Qualitative survey

In a post-experiment questionnaire, we asked participants to describe the strategy that they used to complete the task, whether their strategy changed across repetitions, metacognitive estimates of the number of items they got correct on the first and last repetition, as well as numerical ratings of effort, boredom, drowsiness, enjoyment, frustration, motivation, challenge, and distraction. Note, the first three participants that were run did not complete the survey, leaving N = 49 participants for the survey analyses.

First, we provide an overview of participants’ open-ended responses about strategy use. A total of three raters (two authors) independently rated each survey response for the presence or absence of seven strategy categories. Raters were allowed to endorse more than one category (i.e., for all seven strategies, the rater decided if it was present or absent). The strategy categories included (1) spatial grouping (e.g., “I would focus on remembering two squares at a time”), (2) overt or covert verbal rehearsal (e.g., “I would say the colors out loud to try and memorize the colors”), (3) forming spatial paths (e.g., “I sought to remember squares from left to right then in a clockwise directed oval) (4) forming verbal paths (e.g., “I also memorized the word sequence, not really the color sequence”), (5) salience (e.g., “...first two squares which caught my attention first”), (6) semantic (e.g., “I paired colors that were associated with each other. For example, red + white + blue = american flag, orange + black = Halloween”) and (7) random (e.g., “Majority of the time I just randomly picked squares to remember and it did not work”).

Inter-rater agreement was overall good, with mean agreement of 88.4% across all strategies (minimum: 78.23% for spatial grouping; maximum: 98.64% for salience). To quantify how common each strategy was, we coded each strategy as present for an individual when at least two out of three reviewers agreed. This revealed that the most commonly used strategies were visual grouping (N = 34) and verbal rehearsal (N = 18) followed by: forming a spatial path (N =11), forming verbal paths (N = 5), salience (N = 4), semantic (N = 3) and random (N = 1). Many participants’ responses consisted of a mixture of two or more of these strategies. For example, some participants reported using a mixture of visual grouping and overt or covert verbal rehearsal (N = 10, e.g., “I split the arrays into pairs of squares; I also repeated the first four squares' colors in my head verbally many times”). Task performance is plotted separately as a function of strategy in Fig. S4 (OSM).

Approximately equal numbers of participants said their strategy changed (N = 25) versus did not change (N = 24) from the first to the last repetition. Note, we had intended for this question to reflect changes in strategy across the eight repetitions for each individual array (i.e., whether participants used a different strategy the first time versus the last time they saw a particular array). However, almost all the participants who answered “yes” seem to have interpreted this question as asking whether they changed their strategy from early in the session to later in the session. For example, some participants reported global changes to their strategy over the session, whereby they initially did not have a good strategy for performing the task: “developed strategy more as I went along”; “I tried visualizing the entire screen in the beginning but it was too hard to take in all the squares at one time”; and “strategy varied with motivation.” Other participants reported some fine-grained change to their strategy, or a shift from a more visual to a more verbal strategy as the task progressed: “towards halfway point I changed light blue to teal in my mantra to avoid confusing it with blue square”; “I started to use words instead of trying to memorize colors directly.”

As a group, participants’ metacognitive estimates of the number of items they correctly reported were well-calibrated. Participants estimated that they stored on average 2.31 items (SD = .89) on the first repetition and 6.18 items (SD = 1.76) on the last repetition. In comparison, the ground truth numbers were 2.79 on the first repetition and 6.10 on the last repetition. Note, however, this group-level similarity in estimated and actual performance does not guarantee metacognitive accuracy at the level of individuals (e.g., it may be that no participants were accurate, and those that over- and under-estimated performance canceled each other out). To test the specificity of these estimates for each individual, we also computed correlations and a difference metric. For the correlation, we correlated each participant’s mean self-estimate with their mean performance. For Repetition 1, participants’ average performance correlated with their self-estimate, r = .55, p < 1 x 10-4, slope = .94, intercept = -.32. For Repetition 8, participants’ average performance also correlated with their self-estimate, r = .76, p < 1 x 10-9, slope = .89, intercept = 0.79.Footnote 8 For the difference metric, we calculated the absolute value of the difference between each participant’s performance and their self-estimate. For Repetition 1, the mean absolute difference between actual and self-estimated performance was 0.69 items (SD = .56). For Repetition 8, the mean absolute difference between actual and self-estimated performance was 0.82 items (SD = .81).

Finally, we quantified the subjective ratings obtained from participants about their state of mind during the task. With 1 being “minimum level of X” and 5 being “maximum level of X”, participants’ reported having a level of effort of 3.94 (SD = 0.75), a level of boredom of 2.92 (SD = 0.93), a level of drowsiness of 2.63 (SD = 1.24), a level of enjoyment of 2.49 (SD = 0.87), and a level of frustration of 2.31 (SD = 1.23). When asked how motivated they felt to do their best (1 = not at all motivated, 5 = extremely motivated), participants reported a rating of 3.63 (SD = 0.76)Footnote 9. When asked how challenging they found the experiment (1 = not at all challenging, 5 = extremely challenging), participants reported a rating of 3.63 (SD = 1.05). Finally, when asked how distracted participants felt by thoughts about their own life while doing the experiment (1 = almost never, 5 = extremely frequently), participants reported a rating of 2.63 (SD = 1.11).

Discussion

To accomplish our goals, we frequently need to use working and long-term memory in tandem. However, in scientific studies of memory, we typically try to study memory systems in isolation. In two behavioral experiments, we used explicit repetitions of memory arrays to study the interaction of working and long-term memory. Like many studies of working memory, here we started with stimuli that were abstract and devoid of any pre-existing long-term memory associations (i.e., unique, arbitrary pairings of colors and locations). To allow for the recruitment of long-term memory, we explicitly repeated each array eight times in a row. Using this method, we were able to watch the interaction of visual working- and long-term memory unfold over time. We found rapid improvement of working memory performance to levels far beyond typical visual working memory capacity limits. On average, we found that working memory performance increased at a rate of around 0.4 items per repetition. However, this improvement in working memory performance over time was non-linear, with participants showing a rapid increase in performance over the first few repetitions (~0.72 items per repetition for Repetitions 1–4) followed by a much slower increase (~0.18 items per repetition for Repetitions 5–8). After only a few repetitions, modal performance was perfect for six and eight item arrays that are typically far beyond working memory capacity (Adam et al., 2015). The rapid recruitment of long-term memory observed here is consistent with prior EEG results demonstrating that participants can flexibly hand off a visual search template from working to long-term memory, and that they can flexibly recruit one or both memory systems depending on task demands (Carlisle et al., 2011; Reinhart & Woodman, 2014).

Maximum achieved performance across the two experiments

In both experiments, we found that participants approximately doubled their initial performance as the array repeated, from 2.8 items on the first repetition to 5.3+ items on the eighth repetition (Exp. 1 = 5.32, Exp. 2 = 6.03). Notably, however, the ceiling was also different in the two experiments, with a maximum number of six correct possible in Experiment 1 and a maximum number of eight correct possible in Experiment 2. In Experiment 1, we speculated that we may not have been able to observe further improvement to performance because of a ceiling effect. This potential ceiling effect may have artificially slowed the observed learning as participants approached the ceiling. However, when we raised the set size to eight items in Experiment 2, group-averaged learning rates were not similarly constrained by the ceiling, since average maximum performance was still around six items. Given this finding, the maximum performance that we observed in Experiment 1 may not have been determined entirely by the ceiling, but may instead reflect the rate of learning that is possible across eight repetitions when using highly similar visuospatial arrays as memoranda.

Although some participants achieved ceiling performance in Experiment 2, there was a wide spread of individual differences such that many participants failed to reach the ceiling. In particular, one factor that may have slowed learning in both experiments was the build-up of interference. Specifically, given a limited set of only nine possible colors, there would have been a high degree of perceptual similarity for the ~30 learned arrays. For example, in Experiment 2 (27 unique arrays), eight out of nine possible colors were used for every array and only the specific color-location pairings distinguished the arrays from each other. Indeed, although participants performed above chance on the old-new recognition task in both experiments, recognition memory performance was fairly low (Exp 1 d’ = .45, Exp 2 d’ = .38). This recognition memory performance is lower than has been observed when participants learn a single critical repeated array (Musfeld et al., 2023a, 2023b; Souza & Oberauer, 2022). For example, Musfeld et al. (2023b) found that recognition memory performance for the single learned array was very high (d’ ≈ 1.2 for the “recall one” learning condition [Rec(1)-Rec(1)], d’ ≈ 2.4 for the “recall all” learning condition [Rec(6)-Rec(6)]).Footnote 10 Thus, we think that the high degree of overlap between the ~30 learned arrays in our study contributed to the fairly low recognition memory performance we observed. Future work manipulating the degree of similarity between learned arrays (e.g., by adding context) will be useful for characterizing how the buildup of interference manifests during the interaction of working- and long-term memory.

Inter-response times show clustering consistent with chunk formation

Finally, we found support that inter-item response times track the formation of new chunks in visuospatial memory tasks. The inter-response time data support an account whereby participants initially encode a single chunk of two to three items, and then encode new chunks with subsequent repetitions. In Experiment 1, participants initially showed a notable slowing only on the first response; starting on the second repetition, we observed a second slowing at the fourth response, consistent with the creation and retrieval of a second chunk of three items. In Experiment 2, participants instead grouped pairs of items, but they again showed a formation of additional chunks only for later repetitions. Note, because the stimuli and the responses were at the same spatial locations on each repetition, it is also possible that rote learning for the motor sequence performed while making responses contributed to our learning effects (Carlson et al., 1993). Future work is needed to quantify the relative contribution of “real” long-term memory chunks from motor-planning chunks in inter-response time data like ours.

Limitations and future directions

Future work is needed to understand the specific role of explicit repetitions from other aspects of the task that may aid the recruitment of long-term memory. For example, here we chose to use a whole-report task, which requires participants to make a response for every item in the array. Whole-report tasks provide a lot of retrieval practice, and they may also encourage participants to effortfully encode items because they expect a difficult test. Prior work has found that these particular task factors are key for observing significant learning across incidental repetitions of visual arrays (Musfeld et al., 2023a, 2023b; Souza & Oberauer, 2022). With both explicit repetitions and a whole-report task, we found rapid improvement in performance. Interestingly, this improvement mirrors analyses of individuals’ learning curves conducted in work by Musfeld and colleagues (2023a) using incidental repetitions of a single critical repeated array (the “Hebb array”) and a whole-report task. When analyzed at a group level, incidental learning of the Hebb array appeared to be gradual. However, when analyzed at the individual level, learning appeared to follow a two-step process, with a period of no learning followed by a period of rapid improvement. Critically, the onset of the rapid learning period was related to participants explicitly recognizing that the critical Hebb array was being repeated. In sum, prior results nicely mirror our findings about the ability of explicit repetitions to shape learning (Huang & Awh, 2018; Musfeld et al., 2023a, 2023b; Ngiam, Brissenden, et al., 2019a, 2019b). Future work will be useful to directly compare (1) the onset of learning in an incidental versus explicitly instructed learning context, and (2) making one response versus multiple responses during incidental and explicit learning (Heinen et al., 2022; Musfeld et al., 2023b).

A number of factors could lead to the performance improvement that we observed, including encoding time, elaborative encoding, and retrieval practice. By the end of the eight repetitions, participants had viewed the stimuli for a total of 1,200 ms (150 ms x 8), and they encoded, remembered, and retrieved the items eight different times. Based on prior work, we think that the amount of encoding time, alone, is unlikely to explain the improvements that we observed. For example, in prior behavioral and EEG work using visuospatial working memory tasks, performance was no better when participants were given 200 versus 2,000 ms to encode a visuospatial array (Brady et al., 2016; Tsubomi et al., 2013)Footnote 11. In contrast, there is ample evidence that retrieval practice and testing robustly improve performance in typical free recall tasks (e.g., Rowland, 2014). Based on a qualitative examination of the survey that we administered, we found that while participants reported using many different strategies, the majority of participants used a spatial grouping strategy (remembering pairs of items), a verbal rehearsal strategy (overtly or covertly repeating color names) or a combination of these two strategies. However, few participants reported using a strategy that relies on elaborative encoding of the semantic associations of the colored squares (e.g., flag colors). Taken together, we would speculate that retrieval practice and chunking, as opposed to encoding time and semantic associations, most strongly contributed to the performance improvements in our data. In future work, it would be useful to directly manipulate the strategy assigned to participants, to quantify how semantic associations may further boost the rate of learning when participants are instructed to use elaborative versus rote rehearsal strategies (e.g., Craik & Lockhart, 1972).

Summary

In sum, disentangling how working and long-term memory interact is difficult because they so readily collaborate with one another. Yet, carefully characterizing this interaction is key for understanding how memory functions in realistic settings. Here, we introduced a controlled opportunity for long-term memory to assist working memory by combining abstracted stimuli with explicit repetitions of memory arrays. Participants were successful at rapidly recruiting long-term memory to assist performance – they reached perfect performance for supra-capacity arrays after only a few repetitions. This rapid learning for even highly similar and arbitrary arrays is illustrative of how difficult it is to get a “process pure” measure of working memory in the absence of any support from long-term memory. Looking forward, we think that leaning in to the natural collaboration of working- and long-term memory will be key for furthering our understanding of each memory system, and there is much future work to be done to build this understanding (e.g., probing neural measures, strategy manipulations, and task constraints).