Introduction

The human brain has evolved to learn in information-rich environments in which integration of information from multiple sensory inputs is ubiquitous. While perception and cognition are often studied within one sensory modality at a time, there is growing evidence that much of the neocortex is inherently multisensory (Ghazanfar & Schroeder, 2006; Murray et al., 2016) and that multisensory processing is behaviorally advantageous, particularly when the sources are temporally and semantically related (Diaconescu et al., 2011). Multisensory facilitation can benefit subsequent unisensory perceptual processing (Kim et al., 2008a; Shams et al., 2011; von Kriegstein & Giraud, 2006) and memory retrieval of unisensory information (Heikkilä et al., 2017; Lehmann & Murray, 2005; Murray et al., 2004; Thelen et al., 2015). Models suggest that multisensory facilitation can arise through cross-sensory connections between unisensory representations and/or feedback to unisensory representations supporting stronger encoding in those regions as well as by modification or formation of multisensory representations, so that later presentation of unisensory stimuli activates an expanded, multisensory network of brain regions (Shams & Seitz, 2008).

Working memory (WM) is an example of a system that is thought to be multisensory in nature (Quak et al., 2015). Indeed, parts of the WM circuitry such as the intraparietal and dorsolateral prefrontal cortices have been linked to multimodal maintenance of information (Cowan et al., 2011). WM training has shown to lead to improved performance on tasks that rely on short-term and WM components, particularly when the untrained tasks share similar processes as the trained task (Holmes et al., 2019). It has been suggested that transfer effects only occur if the training and transfer tasks engage specific overlapping brain regions and processes (Dahlin et al., 2008) and that paradigm-specific effects may reflect changes in strategies developed throughout training, rather than an improvement in the efficiency of WM (Forsberg et al., 2020). It is worth noting that mechanisms leading to transfer, and the extent of transfer beyond the domain of WM, are topics of substantial controversy with even metanalyses reaching inconsistent conclusions (Au et al., 2015; Melby-Lervåg et al., 2016; Schwaighofer et al., 2015; Soveri et al., 2017; Weicker et al., 2016); however, this goes beyond the scope of the current paper.

In the present paper, we address issues of how to promote learning and (near) transfer within the same cognitive domain, namely, WM. A WM training protocol that incorporates multisensory objects would not only provide redundancy of information but might further enable stimuli to be encoded into richer, multisensory representations (Kim et al., 2008b; Shams & Seitz, 2008), which may facilitate learning and transfer (Deveau et al., 2014). However, WM training protocols tend to be restricted to the visual domain and/or are not specifically designed to induce multisensory facilitation. For example, many extant procedures, such as the dual n-back, require each modality stream to be processed separately, rendering multisensory integration disadvantageous to task performance. While this type of dual-task performance is certainly demanding, it was later shown that dual n-back training does not lead to greater learning and transfer compared to the single n-back training (Jaeggi et al., 2010; Jaeggi et al., 2014). To date, there are no procedures that specifically implement multisensory facilitation, and thus, the assumption that multisensory training can benefit WM training outcomes is yet to be tested.

Here we test the hypothesis that multisensory WM training will facilitate transfer to untrained tasks within the WM domain compared to visual WM training alone or training that contains visual and auditory stimuli but is presented as separate WM tasks. To address this hypothesis, we randomly assigned participants to one of 3 n-back training conditions and compared their learning and transfer to untrained WM tasks. The conditions consisted of a vision-only training, training that alternated between auditory and visual stimuli, or multisensory training where auditory and visual stimuli were presented concurrently. In the latter condition, pairs of visual-auditory stimuli were constant and congruent, for example, a picture of a bell was always accompanied with the sound of a bell. The performance of the three interventions was further compared with that of a passive control group. These conditions were chosen because they represent commonly used WM training protocols, with visual-only being the most common one, and very few studies include auditory-only WM training (Pergher et al., 2019).

Methods

Participants

Undergraduates from UC Riverside (UCR) and UC Irvine (UCI) were recruited to participate in the study via flyers, e-mails, and word of mouth between Fall 2018 and Fall 2019. A total of 306 students completed a consent form to participate in the study, 240 of which were subsequently enrolled based on their availability (5 times a week for 1 h). Participants were randomly assigned to one of 4 groups: 3 active training groups and a passive control group (see Table 1 for demographics). Forty-two participants were dropped due to attrition or technical errors (see Fig. 1): 6 from the Multisensory group, 10 from Visual Only, 15 from Alternating, and 11 from Passive control group. All procedures were approved by UCR and UCI Institutional Review Boards. Participants provided informed consent and received monetary compensation.

Table 1 Demographics of the four groups
Fig. 1
figure 1

Enrollment

Procedure

Training and assessments were administered on tablet computers via software developed in-house (“Recollect the Study”; available on Google Play and Apple App Store). In order to prevent fatigue associated with long assessment sessions, participants completed 3 sessions of pretest and 3 sessions of posttest separated by at least 8 working days. The active groups conducted training tutorials and practice (day 1), short 20-min training sessions (days 2, 3, 12, and 13) to ease into training, as well as full 40-min training sessions on days 4–11, for a total of 400 min of training (excluding day 1 practice), whereas the passive control group only took part in the assessments (see Fig. 2). An added benefit of administering testing sessions over 6 days is management of expectations of the passive control group (Green et al., 2019). To further probe expectations, at the end of the study (after the final post-test session), participants responded to the following question using a 5-point Likert scale: “Do you think that the sessions you completed during the study helped you perform better on any of the tasks you completed in the last 3 days?”, wherein 1 = not at all, 2 = not really, 3 = cannot say, 4 = quite a bit, and 5 = very much. If the answer selected was 3 or higher, they were also asked to select which task(s), if any, they thought they had performed better on because of earlier sessions.

Fig. 2
figure 2

Timeline of training and testing for passive and active groups. A 5- to 7-min break was given between testing and training blocks and between two 20-min training blocks

Training

The adaptive n-back training game was designed based on principles thought to promote learning (Deveau et al., 2014; Mohammed et al., 2017). In the “Visual Only” condition, all visual stimuli were paired with an identical auditory cue, representing a single n-back training protocol commonly used in the literature (Heinzel et al., 2014; Jaeggi et al., 2010; Küper & Karbach, 2016; Miró-Padilla et al., 2019). An “Alternating” visual/auditory condition consisted of 2 unisensory n-back blocks per day: visual n-back with a placeholder sound and auditory n-back with a visual cue, the order of which was counterbalanced across days (Fig. 2). In the “Multisensory” training condition, each n-back visual stimulus type was paired with a different matching sound.

Transfer Tasks

To assess near transfer, untrained n-back tasks were administered at pre- and post-test, as well as three other WM tasks. Table 2 shows a correlation matrix of baseline performance on transfer tasks. While this dataset is part of a larger study in which a number of other cognitive tasks were administered,Footnote 1 here we test the specific hypothesis that multisensory facilitation improves WM training outcomes. For all assessments, the app provided instructions, examples, as well as performance-gating practice trials with feedback.

Table 2 Spearman’s rank correlation coefficients of performance on working memory measures at pre-test (N = 172)

Untrained n-back

A visual n-back task with two versions, featuring pictures of animals or pictures of vehicles (counterbalanced across participants), was administered to test generalization to untrained stimuli in a non-game setting. On a given trial, a picture appeared on a black screen for 2500 ms with a 500-ms ISI, and participants were asked to tap the picture if it matched the n-back rule. Each block consisted of 30 + N trials with 9 targets (30% target and 30% lure rates). All participants completed N-levels 1–3 but could progress to 4-back (and beyond) if no more than 2 errors were made on the previous level. The main dependent measure was the highest N-level reached at pretest and posttest.

WM Measures

Three WM tasks were used to assess transfer of n-back training. A simple span task, Corsi Blocks, was used to assess WM storage, and two complex span tasks, Sequencing and Symmetry Span, were used to measure WM storage-and-processing (Cowan, 2008). In Corsi Blocks, participants viewed a sequence of 12 Gy squares turning blue and were asked to reproduce that sequence by tapping on the displayed squares in the same order in which they appeared. Sequencing is a tablet version of the WMS-III Letter-Number Sequencing test (Wechsler 1997), in which participants see a mixed sequence of letters and numbers appearing one by one (e.g., “5K8G2”), and are then prompted to enter the numbers in numerical order (“258”) followed by letters in alphabetical order (“GK”). A given sequence did not include any of the characters in the previous trial or consecutive numbers and letters. Symmetry Span is a tablet-based version of Automated Symmetry Span (Unsworth et al., 2009), in which participants are required to recall sequences of red squares in a 4 × 4 matrix while performing a symmetry judgment task.

In all three span tasks, the ISI was 500 ms while stimulus duration was 1000 ms for Corsi Blocks and Sequencing, and 650 ms in Symmetry Span. All three tasks had 2 trials per set size and started at the lowest set size of 2 items. The next set size (e.g., 3 items) was displayed if at least 1 trial was correct, and the task ended when both trials in a set size were incorrect. The highest possible set size was 10 for Corsi Blocks and 15 for the Symmetry Span and Sequencing tasks, ensuring that high performing individuals and/or training-related improvements were not masked by ceiling effects. An individual’s span was defined as the highest set size at which at least 1 of the 2 trials was correct.

Data Analysis

IBM SPSS Version 24 (IBM corp.) and JASP 0.9.2 (JASP Team, 2019) were used for statistical analyses. For untrained n-back, data was available for all 198 participants and no outliers were removed based on |z| > 3. For the three span tasks, task data were removed for participants whose individual span was at a minimum (span = 2) either on pretest or posttest, indicating poor understanding of task demands or inattention. For Corsi Blocks, 1 participant had missing data, but no data points were removed. For Sequencing and Symmetry span, data were available for 196 and 191 participants, respectively, 15 of which were removed from each group (see Table 3 for sample sizes per group). The highest span achieved at pretest was 10 for Corsi Blocks, 9 for Sequencing and 8 for Symmetry Span.

Table 3 Descriptive statistics and within-group Pearson’s r, Cohen’s d and Bayes Factor for performance on pretest and posttest on transfer tests

Gain scores were obtained by subtracting pretest from posttest scores. Since the gain scores were not normally distributed, nonparametric tests were conducted on all transfer tasks. As a first step, we investigated whether the gain itself (pre- vs. posttest change) was significant in any of the groups. If at least one group showed significant gain, Kruskal-Wallis H tests were used to investigate how the four groups, on average, differed in gains, followed by Mann-Whitney U tests.

Results

Training

Participants in the Visual Only and Multisensory groups showed similar training progress, outperforming the Alternating group towards the end of training. In Fig. 3, visual and auditory n-back blocks of the Alternating group are shown separately. The highest N-level gain, calculated as average N-level in the last session minus the average N-level in the first session, was observed for the Multisensory group (M = 1.95, SD = 0.21), followed by the Visual Only (M = 1.75, SD = 0.18) and Alternating group (M = 0.99, SD = 0.12). An independent-samples Kruskal-Wallis test showed that the 3 groups significantly differed in terms of N-level gain (H(2) = 15.82, p < 0.001). Pairwise comparisons with adjusted p values showed that there was no difference in training gain between Multisensory and Visual Only groups (p = 1.0, rFootnote 2 = 0.03), in contrast, the Multisensory group achieved significantly higher N-level gain than the Alternating group (p = 0.001, r = 0.36) and the same pattern was observed for the Visual Only group (p = 0.003, r = 0.33).

Figure 3
figure 3

Mean N-level achieved on a given day, split by visual and auditory blocks for the Alternating group. A dip on day 8 reflects transition from less abstract stimuli to more abstract stimuli (equivalent across groups).

Transfer

There is no evidence that the 4 groups differed significantly at baseline on any of the measures as indicated by Kruskal-Wallis one-way tests, which was expected given the random assignment to groups (see Table 4).

Table 4 Kruskal-Wallis H tests to assess potential group differences at pretest

Untrained n-back

Wilcoxon Signed Ranks Tests demonstrated that as expected, all groups except the control group reached significantly higher N-levels on the untrained tasks at posttest relative to pretest (Visual Only: z = − 4.89, p = < 0.001; Alternating: z = − 4.36, p = < 0.001; Multisensory: z = − 4.12, p = < 0.001; Control: z = − 1.00, p = 0.32). The groups differed significantly in N-level gain as shown by a Kruskal-Wallis test (H(3) = 31.07, p < 0.001; see Fig. 4a): Visual Only vs. Control (U = 568.50, z = − 5.42, p = < 0.001, r = − 0.54), Alternating vs. Control (U = 711.00, z = − 4.51, p = < 0.001, r = − 0.45), and Multisensory vs. Control (U = 730.00, z = − 4.26, p = < 0.001, r = − 0.43). In the active sample (N = 148), N-level gain throughout training showed a significant positive correlation with pre-/post-N-level gain on untrained N-back tasks (rho = 0.45, p < 0.001), which was also observed separately for each group (Visual Only: rho = 0.41, p = 0.003; Alternating: rho = 0.30, p = 0.03; Multisensory: rho = 0.59, p < 0.001).

Fig. 4
figure 4

Difference scores for each group: V = Visual Only, A = Alternating Audio-visual, M = Multisensory, and C = Control (Passive); color coded as red, blue, green, and gray, respectively. (a) Difference between highest N-level achieved at post-test relative to pre-test on an untrained visual N-back task. (bd) Change in span on Forward Corsi Blocks, Sequencing, and Symmetry Span, respectively. Error bars represent S.E.M. * p < 0.05, ** p < 0.01. Stars without lines indicate significant difference at post-test relative to pre-test in each group (Wilcoxon signed-rank test), whereas stars with lines indicate significant differences in change scores in one group relative to another (Kruskal-Wallis test)

Corsi Blocks

We next examined transfer to Corsi Blocks Forward, a measure of WM capacity. The change from pre- to posttest on span was not significant in any of the groups as indicated by Wilcoxon Signed Rank Tests (Visual Only: z = − 0.30, p = 0.77; Alternating: z = − 1.76 p = 0.08; Multisensory: z = − 1.17, p = 0.24; Control: z = − 1.15, p = 0.25); hence, no further analyses were conducted (Fig. 4b).

Sequencing

Wilcoxon Signed Rank Tests demonstrated significant gain in sequencing span for Alternating (z = − 2.21, p = 0.03) and Multisensory (z = − 2.85, p = 0.004) groups whereas Visual Only (z = − 1.37, p = 0.17) and Control groups (z = − 1.84, p = 0.07) did not show significant gain (Fig. 4C). Overall, the groups did not differ significantly in span gain (H(3) = 4.27, p = 0.23); however, there was a trend for the Multisensory group to show larger span gain compared to the Visual Only group (U = 759.50, z = − 1.78, p = 0.075, r = − 0.19) and to the Control group (U = 778.50, z = − 1.77, p = 0.077, r = − 0.19). Training gain did not predict span gain (rho = 0.01, p = 0.96) in the active sample (N = 134).

Symmetry Span

The only group that showed significant improvement in symmetry span was the Multisensory group (z = − 3.31, p = 0.001), whereas the other groups did not improve (Visual Only: z = − 0.16, p = 0.88; Alternating: z = − 0.58, p = 0.56; Control: z = − 1.34, p = 0.18). While a Kruskal-Wallis test did not show significant differences in span gain among the groups (H(3) = 6.33, p = 0.10), a comparison of gain in the three active groups was marginally significant (H(2) = 5.96, p = 0.05), prompting comparisons between pairs of active groups. As can be seen in Fig. 4d, the multisensory group showed higher span gain than the Visual Only group (U = 667.50, z = − 1.96, p = 0.05, r = − 0.25) and the Alternating group (U = 634.00, z = − 2.26, p = 0.03, r = − 0.21), albeit the effect sizes are small. Moreover, training N-level gain showed a significant positive correlation with symmetry span gain in the active sample (rho = 0.23, p = 0.01, N = 128); however, this was not observed for the individual groups (Visual Only: rho = 0.11, p = 0.47; Alternating: rho = 0.20, p = 0.20; Multisensory: rho = 0.25, p = 0.13).

Self-Reported Transfer

Out of 182 participants that answered the question “Do you think that the sessions you completed during the study helped you perform better on any of the tasks you completed in the last 3 days?”, the most common answer across all four groups was 4 (“Quite a bit”). The highest average score was reported by the Multisensory group (M = 3.74, SD = 0.93, N = 46), followed closely by Visual Only (M = 3.71, SD = 0.76, N = 45), Alternating (M = 3.52, SD = 0.85, N = 48), and the Passive control group (M = 3.47, SD = 0.80, N = 43). A Kruskal-Wallis H test indicated that the four groups did not show a statistically significant difference in their answer to this question (H(3) = 5.33, p = .15). These results provide tentative evidence that the four groups perceived a similar rate of change in post-test performance as a result of participating in the study (even the passive control group). Figure 5 shows the percentage of participants per group that selected a given task as one in which they thought they had improved on as a result of participating in the study. Note that this question was only presented to participants who answered 3 (“Cannot say”) or higher on the previous question, i.e., to 163 participants (89.6%) and that they were not required to select any tasks.

Fig. 5
figure 5

Self-reported improvement on tasks at post-test as a result of earlier sessions, shown as percentage per group; V = Visual Only, A = Alternating Audio-visual, M = Multisensory, and C = Control (Passive)

Discussion

Based on the rich literature on benefits of multisensory exposure to learning and memory (Matusz et al., 2017; Quak et al., 2015; Shams & Seitz, 2008), it was hypothesized that WM training could be facilitated by introducing multisensory objects in an n-back training task. In order to test this, performance on multisensory n-back training was contrasted against Visual Only and Alternating audio-visual n-back training, as well as a passive control group. Results showed that participants in the Multisensory group had equal N-level gain compared to Visual Only training yet showed significant transfer to some untrained WM tasks, whereas the Visual Only condition did not. This is in line with studies suggesting that multisensory experience can facilitate memory (Thelen et al., 2014) and learning (Shams & Seitz, 2008). In fact, it has been demonstrated that even a single multisensory exposure can influence memory for visual and auditory objects (Matusz et al., 2017). Multisensory cues can also improve delayed retention of incidental learning (Broadbent et al., 2019) and aid perceptual learning (Seitz et al., 2006). The main question of interest here, however, is whether multisensory training can boost near transfer.

As expected, all three active training groups improved on an untrained N-back task compared to the control group. Training N-level gain predicted transfer to untrained N-back tasks in the active sample irrespective of condition. There was no evidence of training-related enhancement of WM capacity as measured by a simple span task for any of the groups, possibly due to the higher capacity used in this task compared to that experienced during training (e.g., typical N-back span was 4 whereas typical Corsi span was 7). However, the Multisensory group showed significant performance improvement on complex span tasks, particularly symmetry span, and to some extent, sequencing. Training on a task that requires rapid updating of temporary bindings in WM may have strengthened the processes involved in manipulation of information that are crucial for performance on complex span tasks. Indeed, research indicates that updating tasks, complex span tasks, and binding tasks share a large proportion of variance (Wilhelm et al., 2013).

We note that a key component of the Multisensory training employed here is that the auditory and visual stimuli at each time-point supported the same task response. This is a key element of multisensory facilitation (Shams & Seitz, 2008) and is in contrast to dual WM tasks in which participants have to perform a secondary task while maintaining information in WM (Fougnie & Marois, 2011), or dual N-back tasks that require updating of two separate streams of stimuli, one in the visual and the other in the auditory domain. We suggest that while these dual tasks may give rise to some benefits of multi-tasking (Anguera et al., 2013), they put the senses at competition with each other and do not give rise to facilitation (Deveau et al., 2014).

While the current study examined transfer to a variety of visual working memory tasks, we note that the question of transfer to auditory or multisensory working memory tasks has not been addressed. Further work will be required to test the hypothesis that the multisensory condition would also be advantageous to transfer to working memory tasks that involve auditory, and maybe even other stimulus modalities. We also acknowledge a limitation in that we did not directly assess placebo effects and/or enjoyment; however, posttest self-report results indicated that the four groups perceived a similar rate of transfer overall, meaning that even the passive control group felt some improvement—likely due to retest effects. When asked whether they observed improvement on specific tasks, the pattern of results mirrored the measured transfer effects, indicating good insight into individual outcomes of the study (Tsai et al., 2018). At present there is extremely limited and mixed data on whether expectation-based effects can drive transfer effects (Green et al., 2019); hence, this should be addressed by future research. A further limitation of the study is the use of a passive control group. While passive control groups only control for retest effects, active control groups can help manage participants’ expectancies, equalize participant-experimenter contact time, and reduce demand characteristics (Boot et al., 2013). On the other hand, designing an appropriate control training protocol that does not rely on WM is no trivial task, and certainly not as easy as administering a placebo drug. Furthermore, a recent meta-analysis did not find any evidence that control group type meaningfully impacts effect sizes from cognitive measures (Au et al., 2020). Even though the use of active control groups is still strongly recommended, research involving passive control groups may still be informative and useful, especially in early, proof-of-concept studies such as this one (cf. Green et al., 2019).

In summary, the present results suggest that incorporating multisensory objects in a WM training protocol can benefit performance on the training task compared to training WM in each sense separately and that multisensory training can potentially facilitate transfer to complex WM span tasks. To gain a better understanding of multisensory facilitation of WM training, future research should include an auditory-only training group, auditory WM transfer tests, and explore whether multisensory facilitation affects far transfer tasks as well. Evidence of multisensory facilitation of WM training could inform development of training protocols, particularly those targeting older adults, where supporting learning via multiple senses may be advantageous to those who experience transient deficits in either sense.