Introduction

In the context of theories in psychology, Mischel referred to the presence of a ‘toothbrush problem’—i.e., that “no self-respecting person wants to use anyone else’s” (Mischel 2008). In this review, we show that experimental paradigms in motor learning suffer from a similar toothbrush problem; tasks used are extremely fragmented, with little to no overlap between different studies. We argue that this extreme fragmentation has negative consequences for the field from both theoretical and methodological perspectives. Finally, we propose the use of common ‘model task paradigms’ to address these issues.

How fragmented is motor learning?

Review of motor learning studies

To quantify the degree of fragmentation in the motor learning literature, we selected publications in 2017 and 2018 from the following five journals: Experimental Brain Research, Human Movement Science, Journal of Motor Behavior, Journal of Motor Learning and Development and Journal of Neurophysiology. We included a study in our analysis if (i) there was change in motor performance arising due to practice, (ii) there were at least two a priori defined groups which were compared in a between-subject design on the same task, and (iii) the population included was unimpaired adults. Overall, 64 papers were included based on these criteria.

These criteria were based on the following rationale. Our focus was on motor learning studies from a behavioral emphasis, which led to us select journals that publish regularly on this theme. The timespan of 2 years was to provide a reasonable sample of studies (our goal was to do at least 50). Because one of our particular emphasis was to examine issues related to design and sample size, the two-group minimum requirement was used so that our analysis could focus on the more common ‘between-group’ designs (rather than ‘within-subject’ designs). Also, studies with within-subject designs in motor learning tend to typically describe changes associated with learning (e.g. changes in EMG or the use of redundancy) rather than compare different practice strategies, which make them less relevant to the central arguments related to fragmentation being advanced here. Finally, because tasks often have to be modified for children or adults with movement impairments, we focused on studies with unimpaired adults to get an estimate of fragmentation in cases where the tasks do not necessarily have to be modified.

For each study, we examined the task used and classified it into one of six categories—adaptation, applied, coordination, sequence, tracking, and variability. This classification was based on prior work that have highlighted such distinctions (e.g. scaling vs. coordination (Newell 1991), or adaptation vs. skill learning (Krakauer and Mazzoni 2011)), but these prior distinctions have typically only been dichotomous. So, we expanded these into six categories to more accurately group the types of tasks in the studies that were reviewed. The broad rules for each of these categories are specified in Table 1. Five of these categories were based on the types of learning involved in the task (adaptation, coordination, sequence, tracking and variability), while the ‘applied’ category was used for studies where either the task itself was central to the research question or it was used without modification from its implementation in real-world settings. For example, a golf putting task in a lab was categorized under ‘variability’, but a study examining professional golf players putting on an actual green was categorized as ‘applied’. We adopted this category because we felt that these studies, where the task is integral to the research question, are less likely to benefit from the development of a common task paradigm. Moreover, although these categories were not always mutually exclusive, for the sake of this review, we classified each study into only one category. When there were conflicts on the categorization, the authors resolved such conflicts through discussion.

Table 1 Definition of task categories

We then examined the actual tasks used in each category to determine if these studies used the same experimental paradigm. For each task, we extracted relevant parameters that were reported in the Methods section regarding the implementation of the task. For example, when reaching was used as a task in the ‘adaptation category’, we compared experimental parameters such as the type of perturbation, the amplitude of the reach, the number of targets, and instructions to participant such as whether they had to stop inside the target or shoot through it) Similarly, for a sequence learning task we examined the number of elements in the sequence, how many fingers were used, the instructions to the participant such as “do a fixed number of sequences” vs. “do as many sequences in a fixed time” etc.). It is important to note that these coded features were only based on the task itself, and did not include differences in protocol information (such as the amount or duration of practice). Based on this information, a task paradigm was classified as ‘unique’ if it did not match any other task paradigm in its category. In addition to the task paradigm, the sample size/group was also calculated. If there were multiple experiments, the sample size/group was simply computed as the total sample size/total number of groups. This information for each task along with the coded features is summarized in Table 2.

Table 2 Review of motor learning studies from selected journals in 2017–2018

Overall, we found that a majority of the studies on motor learning were on adaptation (36%), followed by coordination (17%), sequence learning (16%), variability (12%), tracking (11%), and applied tasks (8%) (Fig. 1a). Group sample sizes were typically in the range of 10–16 participants (25th–75th percentile) (Fig. 1b). Critically, we found that out of the 64 studies reviewed, there were 62 unique task paradigms, a fragmentation of ~ 97% (Fig. 1c). In fact, of the two matches found, one was from the same group of authors, and the second was a standard commercially available device. Although many of studies used task paradigms that belong to what could be considered the same task category (e.g., visuomotor adaptation in reaching), they varied in the implementation of the task in terms of relevant task parameters (see Table 2). These results highlight the high degree of fragmentation in task paradigms in the field.

Fig. 1
figure 1

Fragmentation of task paradigms in motor learning. a Distribution of task categories across all studies—studies were divided into six categories based on the criteria shown in Table 1. b Distribution of sample size/group across all studies. The majority of studies ranged between 10–16 participants in a group. c Plot of the number of studies reviewed in the current paper on the x-axis against the number of uniquely different tasks on the y-axis. The two dots highlight the two studies that were not classified as unique. The almost perfect diagonal line indicates that there was little to no overlap in tasks between studies across all six categories, (d) Variability in putt distance across golf putting tasks shows fragmentation of task paradigms even within the same task. The peak at 3 m is almost exclusively driven by a single author group

Review of task fragmentation within the same task—golf putting

One possible reason for the high level of task fragmentation could be a consequence of the fact that motor learning is a diverse field, and tasks are tailored for specific research questions. Therefore to examine if task fragmentation exists even when researchers seemingly choose the same task, we performed a second analysis where we focused on studies that all used a relatively common task—golf putting. We used the search phrases ‘golf’ and ‘motor learning’ in Web of Science, and studies from 2013–2018 were examined (we changed the timeframe to increase the number of articles that were included). The same inclusion criteria were used as before, and we only focused on putting studies (e.g. studies on the golf swing or chip shots were not included). The relevant parameters that were extracted for this task were the putt distance, target type, target size, the surface type, and the scoring system were extracted (Table 3). The scoring system, while not strictly part of the task itself, was included as a factor because it has a direct influence on how results from different studies can be interpreted relative to each other. Overall 22 studies were selected.

Table 3 Review of selected motor learning studies using golf putting as a task in 2013–2018

Once again, results showed that even within the context of the same task, studies used a variety of different putt distances (Fig. 1d). Even though there were several studies that used a putting distance of 3 m, this was almost exclusively driven by a single group of authors. In addition, there were also variations in the hole type, diameter, and the scoring system (Table 3). As emphasized earlier, even though such differences may seem trivial at first glance, they can create important differences between studies. For example, the use of an actual hole (as opposed to a circle) increases ecological validity but adversely affects estimation of error and variability. This is because a range of ball velocities will land the ball in the hole and be ‘compressed’ as zero error. Similarly, from a scoring standpoint discretizing a continuous error measure, such as the distance of the ball from the center of the target into measures such as the number of successful putts or a points system, has the potential to significantly distort learning curves (Bahrick et al. 1957). Finally, other information was incomplete to the extent that it would be difficult to perform a direct replication. For example, in the putting surface, the term ‘synthetic’ or ‘artificial’ green was used in several papers; however, only two papers mentioned the ‘speed’ of the green used, which is a critical variable for replication. These results highlight that even when the same general task is chosen, there is still a high degree of fragmentation in task paradigms across studies.

Consequences of task fragmentation

Given the evidence for fragmentation, one argument could be that these variations in parameters are trivial and do not alter our understanding of motor learning in any meaningful way. We wish to highlight two things in this regard—(i) there is evidence that at least some of these trivial parameters can have a significant influence on the dependent variables, which in turn may affect inferences about learning. For example, in visuomotor rotation, the choice of the number of targets and the rotation angle has been shown to influence the rate of learning and the magnitudes of implicit and explicit learning (Bond and Taylor 2015) (ii) in rare cases, these parameter changes do have the potential to completely alter the conclusions observed. For example, the presence or absence of catch trials during training in force field adaptation has been shown to influence whether motor memory consolidation is observed (Overduin et al. 2006). Given these possibilities, we describe both the theoretical and methodological consequences of task fragmentation in detail.

Theoretical consequences

From a theoretical standpoint, the fragmentation of tasks across studies makes every finding an ‘island’ and hampers the cumulative progress of science, where researchers can replicate the findings and then build off of previous research (Zwaan et al. 2018). This issue is critical in light of the recent ‘replication crisis’, where well-known effects in many fields have either failed to replicate or have had smaller effects than once originally assumed (Camerer et al. 2018; Open Science Collaboration 2015). Although one argument for using different tasks is that they can be useful in testing the generality of theories and hypotheses in different contexts, the utility of exclusively relying on such ‘conceptual replications’ has been challenged because it is subject to confirmation bias (Chambers 2017). In other words, if the results of the original and the conceptual replication agree, then it is taken as evidence of the generality of the effect; however, if the results of the original study and the conceptual replication differ (i.e. the replication ‘failed’), then these differences are directly attributed to the changed task parameters in the conceptual replication. This leads to a situation where the robustness of the original finding can never be questioned. Moreover, it has been argued that even when such conceptual replications fail, current incentive structures at many journals make the likelihood of publishing these results low, leading to a biased literature (Pashler and Harris 2012).

This issue of being able to exactly replicate a study (i.e. a direct replication) is especially important in the context of motor learning because of the recognition of the role of tasks and task constraints in determining behavior (Newell 1986, 1989). Given that tasks can vary along several dimensions, it is perhaps not surprising that many of these dimensions have been used as explanations for discrepancies in results across studies—e.g., practice spacing effects are affected by whether tasks are discrete or continuous (Lee and Genovese 1989), frequency of augmented feedback effects are affected by whether tasks are simple or complex (Wulf and Shea 2002), and sleep consolidation effects are affected by whether the tasks involve sequence learning or adaptation (Doyon et al. 2009). Although these task dimensions certainly play an important role, it is also important to recognize that the true effect that they play cannot be fully understood without ensuring the robustness of the original findings through direct replications.

Methodological consequences

In addition to theoretical consequences, there are also methodological consequences of task fragmentation. Here, we focus on three primary consequences of such fragmentation—(i) arbitrariness in choice of task parameters, (ii) arbitrariness in choice of sample size, and (iii) inability to compare magnitudes of effects across different manipulations.

Arbitrariness in task parameters.

At the experimental design stage, the use of a new task poses a challenge for the experimenter because it requires making choices about several task parameters that are critical to the experiment without sufficient information. For example, in a motor learning study, a common parameter that is a critical part of the experiment is the practice duration; yet this choice is rarely explicitly justified. Instead, researchers are likely to choose these values through a combination of unpublished pilot testing, applying rules of thumb based on other published studies, and convenience (e.g., choosing the shortest duration possible). These arbitrary choices can greatly limit the generalization of motor learning findings to the real-world—for example, in spite of the seemingly diverse range of tasks used, very few studies focus on the period after extended practice, when an effective performance plateau has been reached (Hasson et al. 2016).

An even bigger challenge is the choice of the experimental manipulation itself. Typically, studies in motor learning involve a between-group manipulation of an independent variable with the experimenter having to decide on what the values of these variable are. For example, a study on variable practice may use a throwing task and compare two groups—a ‘constant’ group that always practices throws to a target from the same distance, and a ‘variable’ group that practices throws to different distances (Kerr and Booth 1978). However, the critical parameter choice of how much variability the variable group experiences can have a major influence on the observed results (Cardis et al. 2018). This is because most practice strategies (e.g., practice variability, practice spacing, feedback frequency) are likely to be ‘non-monotonic’ with respect to their influence on learning—i.e., there is an optimal level that maximizes learning, and there is a decrease in the amount of learning both above and below this optimal level (Guadagnoli and Lee 2004).

When the choice of this task parameter is made without sufficient information, it means that the experimenter does not know where the groups lie on this non-monotonic function (Fig. 2). As a result, even when the underlying effect of a manipulation is consistent across studies, different studies may get seemingly ‘contradictory’ results simply because they are sampling at different points of this function (Fig. 2a–c). To further compound this problem, when studies use different tasks, these discrepancies in results caused by variations in sampling may get incorrectly attributed to the task itself. One potential solution for this problem is to characterize a full ‘dose–response curve’ for each task and task manipulation by increasing the number of groups and sampling across the full parameter range (Fig. 2d). However, given the amount of effort needed to establish this dose–response curve, doing it for every new task would likely be impractical. These highlight the need for fewer tasks, and more data on these tasks to make more informed decisions on these task parameters.

Fig. 2
figure 2

Arbitrary selection of task parameters can lead to seemingly ‘contradictory’ results. Considering that the effect of many manipulations in motor learning (e.g., variability, spacing, feedback frequency etc.) are likely non-monotonic, task parameter selection becomes critical. For example, in a simple two group design (the `low' group in blue, and the `high' group in red), even though the underlying relation is the same in all cases, arbitrary selection of the task parameter can lead to (a) the high group learning more than the low group, (b) no difference between groups, or (c) the low group learning more than the high group. d A dose–response curve with more groups avoids this issue by providing a complete description of the non-monotonic relation, however given the much larger sample size required, establishing such dose–response curves becomes impractical when each study uses a different task

Arbitrariness in sample size planning

Another key parameter choice in experimental design is the sample size. Several reviews have emphasized the need for a priori sample size planning, because the lack of adequate power stemming from small sample sizes can greatly reduce the reliability of the findings in the literature (Button et al. 2013). However, just like the task parameters, sample size planning tends to become arbitrary when a new task is used. Consistent with a prior review of studies in motor learning (Lohse et al. 2016), sample sizes seen in our current review were around the 10–16/group (25th–75th percentile), regardless of the effect being studied. These sample sizes suggest that they are likely driven by heuristics for meeting a ‘publication threshold’ for journals.

The major barrier to doing a priori sample size planning in a new task is the lack of information on the expected effect size, or alternatively the ‘smallest effect size of interest’ (Lakens et al. 2018). Effect sizes estimated from small samples of pilot data are not reliable (Albers and Lakens 2018), and even meta-analytic estimates of effect size in motor learning seem to suffer from issues of small sample size and publication bias (Lohse et al. 2016). Moreover, as mentioned in the task parameters section, heterogeneity in the effect size across different studies could also be driven by factors such as variation in the tasks and task parameters. These issues again point to the need for more data on fewer tasks to obtain reliable estimates of effect sizes, and thereby determine an appropriate sample size.

Uninformative magnitude of effects

Finally, if motor learning studies are to be of relevance to the real world (and not just mere demonstrations of effects), the goal is not only to detect ‘if’ there is an effect of an intervention, but also to estimate the size of the effect—i.e., what can cause the ‘biggest’ effect. A literature with fragmented tasks is detrimental to this goal because it prevents any relative comparisons of magnitude of effects across different experimental manipulations. For example, how do we compare the relative benefits of manipulations like practice spacing, variable practice, or self-controlled practice if each of these manipulations uses a different task with different dependent variables? Although this might seem like an ‘apples to oranges’ comparison, knowing at least the effective range of performance improvements for each of these manipulations is critical to determining an effective training paradigm in the real world. For example, in rehabilitation studies, this comparison between different types of therapies (e.g., robotics vs. conventional therapy) is routinely done by comparing the improvement in movement impairment measured on a common scale (e.g. the Fugl-Meyer score) (Lo et al. 2010). However, in motor learning studies, this common scale requires the use of the same task because (i) unlike measures of movement impairment, measures of motor learning, by definition, are specific to the task, and (ii) using standardized effect sizes (e.g. Cohen’s d) to make comparisons across tasks can be problematic because factors other than the mean difference, such as sample variability influence these effect sizes (Baguley 2009). These issues highlight that for comparing magnitudes of effects between different manipulations, there is a need for common tasks (and associated dependent variables), where improvements in performance can be directly compared across studies in raw units of measurement (e.g., error measured in centimeters).

A call for ‘model’ task paradigms

In light of these issues, we feel there is a necessity to create a few ‘model’ task paradigms for motor learning studies. These model task paradigms could serve as a common testbed for several studies on motor learning and labs all over the world could run identical (or nearly identical) experiments using these paradigms. This proposal is analogous to the study of model organisms in related fields like biology and neuroscience. In such model systems, there is a recognition that it is not fully possible to study all possible variations, but that the knowledge gained from systematic study of a few carefully chosen representative systems can provide important insights for the field.

Characteristics of model task paradigms

What would be the characteristics of model task paradigms used for motor learning? Although a number of factors may be involved in determining this (e.g., whether the timescale of learning is feasible, whether the difficulty is appropriate so that a majority of the participants can learn the task etc.), our view is that a task chosen for a model paradigm would have to score high at least on two dimensions: (a) relevance and (b) replication.

The ‘Relevance’ dimension measures how well the paradigm addresses the scientific question of interest. Tasks that score high on relevance would include consideration of features such as face validity (e.g., how well does the task represent the motor learning process of interest), high internal validity (the ability to tightly control extraneous factors in the experiment) and high external validity (the ability to generalize to other contexts) including ecological validity (the ability to generalize to learning of real-world tasks). It is important to note that many of these factors may be unknown at the time of the task development—so there is a need for domain expertise and qualitative judgment in determining relevance. Given that motor learning likely involves distinct types of processes in the different categories specified in Table 1, it is likely that model tasks for each of these processes would also have to be distinct to directly address these specific processes (i.e. a model task for adaptation would be different from a model task for sequence learning or one for variability). Importantly though, by specifying these model tasks at the level of these processes associated with learning (which are relatively few in number), researchers would be capable of reusing the tasks to address multiple research questions. For example, a model task of sequence learning could be used to address multiple varied research questions such as the role of sleep and consolidation (i.e. whether sleep enhances learning), contextual interference (i.e. whether random practice is superior to blocked practice), or the effect of self-controlled practice (i.e. whether having control of the practice sequence is superior to randomly practicing sequences).

The ‘Replication’ dimension measures how easy it is to replicate the paradigm in different labs, with access to potentially different resources. Tasks that score high on replication would involve tasks with low reliance on specialized equipment while still allowing high measurement precision. For example, for a task involving control of variability, an underhand throw to a target would score higher on the replication dimension compared to a golf putt because it does not require a specific putter nor is it affected by environmental factors such as the friction between the ball and the surface. Similarly, a model task paradigm using inexpensive tools like webcam-based marker tracking (Krishnan et al. 2015) or markerless video tracking (Mathis et al. 2018) would score higher on the replication dimension than tasks requiring the use of specialized expensive equipment because it is likely to be more easily replicated in more labs with access to fewer resources.

Once an appropriate task is identified, the next step in making it a ‘model task paradigm’ is to ensure sufficient transparency that other researchers can replicate and build off these results. This involves two major steps—(a) the tasks are specified in enough detail that other groups can replicate the tasks as closely as possible, and (b) the data and analyses from these experiments are shared in a public repository so that the results from prior experiments can be combined and compared with results from future experiments. Practical guidelines for sharing of methods and data have been extensively reviewed in other domains (Gorgolewski and Poldrack 2016; Klein et al. 2018). Specifically in terms of motor learning, because of the richness and complexity of behavior possible, a particularly relevant solution is the use of video to help other researchers replicate the procedure more closely (Gilmore and Adolph 2017).

Advantages of using model task paradigms

The use of model task paradigms directly addresses the challenges raised in the previous section. First, from a theoretical standpoint, model task paradigms permit direct replications which increases the likelihood of finding effects that are robust. Second, by adopting a ‘replicate and extend’ strategy (i.e. the experiment involves direct replication of a previous experiment but also collects data on some new parameter values), data from the first few studies would effectively yield ‘dose–response curves’ (such as those shown in Fig. 2d) that can provide important information about designing task parameter values for experimental manipulations. In fact, the use of model task paradigms opens the door for large scale studies across the globe that multiple labs can collaborate on—see for e.g. Psych Science Accelerator (Moshontz et al. 2018). These approaches may allow answering questions with large sample sizes that are currently not being investigated (e.g. individual differences) because they are beyond the scope of a single lab. Third, the presence of openly available data on a single task paradigm can produce more reliable estimates of effect sizes, and also facilitate discussion on what theoretically meaningful effect sizes are. This information, analogous to the minimally clinical important difference (MCID) used in rehabilitation studies, is critical to making the distinction between ‘statistically significant’ and ‘meaningful’ results. Finally, using model task paradigms across different types of manipulations will allow direct comparisons in terms of raw effect sizes between different types of practice strategies, allowing practitioners with a good understanding of the relative utility of these strategies.

Beyond addressing these challenges, another feature of using model task paradigms is that they can effectively constrain ‘researcher degrees of freedom’ (Simmons et al. 2011). Although the term has been used to describe how undisclosed flexibility in data collection and analyses (such as flexibility in sample size or choosing among dependent variables) can make anything look ‘statistically significant’, the same issue also arises in the context of flexibility in task paradigms. For example, early studies on the role of augmented information in motor learning often chose task paradigms with extremely poor intrinsic information, such as drawing a line of a specified length. As a result, the role of augmented information was overrated in these tasks because it was often the only way participants could know what the task goal was (Swinnen 1996). Relatedly, the measurement of learning in these contexts has also used somewhat artificial situations such as No-KR tests, which often involve blindfolding participants from seeing the natural outcome of their actions (Russell and Newell 2007). Such criticisms have raised an important question of how relevant principles derived from simple tasks are to real-world learning (Wulf and Shea 2002; Russell and Newell 2007). Because model task paradigms are common only to broad themes in motor learning, and not at the level of individual research questions, they can effectively constrain flexibility in ‘tweaking’ of the task (intentional or unintentional) because the task and analyses are largely fixed in advance.

Last but not least, the use of model task paradigms also allows ‘data-driven’ discovery that could complement the dominant ‘hypothesis-driven’ approach in motor learning. The availability of relatively large data sets on a few standardized tasks could yield answers to questions that were not originally the focus of the work. An example of this in the motor learning has been the DREAM database (Walker and Kording 2013). Originally established as a collection of different experiments on reaching, data from these experiments were subsequently used to address a question about variability and rate of motor learning (He et al. 2016). In addition, these large data sets can also help serve to generate and test new theories or models of learning, as any new proposed theory or model at least needs to adequately accommodate for these data before making other testable predictions for future experiments.

The key steps involved in developing a model task are illustrated in Fig. 3. To demonstrate this with an example, consider the underarm ball toss to a target as a model task for learning to control variability (Rossum and Bootsma 1989). First, this task scores relatively high on relevance (it has good face validity because throwing the ball accurately to the target requires control of motor variability, internal validity can be increased through control of extraneous factors, and it also likely has ecological validity given several real-world motor learning tasks like the basketball free throw or golf putting require control of variability). Second, this task also scores relatively high on replication (because the only implement being used is a ball, such a paradigm is easy to replicate in any lab without the need for expensive or specialized equipment). Third, to make data in this task useful for other researchers, the dependent variable of task performance would have to be measured in ‘real-world units’—for e.g., the error from the target would need to be measured in centimeters (instead of a points scale for example). Fourth, initial studies using the task would aim to establish learning under a range of conditions involving variations of experimental parameters—for example, varying target distances or the amount of practice. Sample sizes for these initial studies may rely on broad ‘thumb rules’ for effect sizes (such as Cohen’s d of 0.5 etc.). Fifth, the data then have to be examined for how well they can be used to make inferences about the underlying learning—e.g., does task performance plateau too quickly, is the learning too variable between subjects, is the learning retained over a period of time? These questions will depend on what the underlying question of interest is—for example, high between-subject variability may actually be desirable if the goal is to examine individual differences. Sixth, the protocol and data is then deposited in a public repository (e.g., Open Science Framework) that is available for other researchers to use. The last and final step is how other scientists in the field perceive the proposed task—if the community is convinced of the utility of the proposed task to examine the motor learning process of interest, this task is adopted for further experiments and becomes a ‘model task’.

Fig. 3
figure 3

A roadmap for constructing model tasks. At each stage of research (experimental design, data collection and analysis, and dissemination), the key steps and examples of associated questions that may need to be addressed are highlighted. The final step “acceptance by broader scientific community” is not undertaken directly by the experimenter but indicates how a proposed task eventually becomes a model task. It is important to note that these processes and questions are not intended to be an exhaustive list for every context but rather provide a guide for such decisions. Ultimately, the goal of these model tasks is to enable work to be combined across multiple studies and labs in a way that can establish robust findings

Once a model task is established, subsequent studies could then leverage this information in different ways. For example, one could start getting into manipulating practice strategies through dose–response studies. Using the underhand throw example, a study on variable practice could use multiple groups with wide range of practice variations (instead of the conventional two group design) to examine in what parameter range the ‘strongest’ effect of variable practice occurs in this task. Moreover, the learning curves established in the initial studies would also be informative in making magnitudes of effects more interpretable in terms of the time scale of learning (Day et al. 2018)—for e.g., if two groups differ by a throwing error of 2 cm at the end of practice, that could be potentially translated into something like ‘using this practice strategy produced an improvement that would normally take 100 additional trials of practice’. Finally, because data are collected under the same standard conditions and shared across studies, it would become easier for future studies to determine effect sizes more precisely, which would then lead to more efficient sample sizes and a robust base of evidence for findings.

Costs to using model task paradigms

From a practical standpoint, there is a cost in terms of the initial effort involved in developing model task paradigms compared to the status quo. Potential factors that drive this effort are (a) more careful consideration of tasks, (b) precise specification of associated measurement, (c) the use of larger sample sizes with more groups, and (d) the effort involved in making data and resources openly available. However, we think that benefits are cumulative, with subsequent studies becoming much easier to plan and execute because it would allow investigators to skip over repeated pilot-testing phases, and use previously published data to make informed estimates about new experiments. Second, there is a potential risk of duplication if two research groups run the same experiment using the same model task. However, with the rising popularity of pre-prints and formats such as registered reports which allow for ‘results-blind’ acceptance (Caldwell et al. 2019; Chambers 2013), we believe that such concerns can be overcome.

More broadly for the field, there is a potential concern that model task paradigms may narrow the impact or generality of the field by decreasing the diversity of phenomena being studied. This concern has been expressed in the context of model organisms in biology (Yartsev 2017) and it is important not to fall back on the ‘easy’ route of only studying questions that can be answered using existing model task paradigms. At the level of the individual researcher, an overreliance on model tasks may hamper creativity and limit new discoveries. However, given that motor learning is currently at the other extreme with excessive fragmentation, we think that this concern, is at least for the moment, not a major one. In fact, as model task paradigms emerge for different themes, they may in fact actually help increase the diversity of problems studied by more clearly revealing which issues have received less attention, and provide opportunities for addressing such gaps through creative discovery. Moreover, model tasks themselves are not fixed but shaped by the scientific community—as some tasks reach a point of diminishing returns in terms of their utility, these could be replaced by other model tasks.

It is also perhaps worth re-iterating that model task paradigms are not meant to be a requirement for every experiment. Research questions at either extreme of the theoretical-applied spectrum are likely to continue to use customized tasks that suit their purpose. On the theoretical side, studies may involve a very specific manipulation (e.g., using a robotic exoskeleton to perturb a single joint) that requires the use of a task which does not fall into one of the model tasks. Similarly, on the applied side, there will always be a need for applied studies where the task itself is critical to the research question being answered (e.g., improving surgical technique). However, for the vast majority of studies in the middle of this spectrum, which have some flexibility in the choice of tasks, model task paradigms may provide a solution to the current level of fragmentation. These paradigms will also continue to evolve with greater theoretical understanding and improvements in measurement tools. Ultimately the success of any model task paradigm will depend on how other researchers in the field see its value, both in terms of the insights it generates, and in terms of how these insights generalize to the real world.

Conclusion

In his highly influential paper on motor learning, Jack Adams (Adams 1971) criticized the use of real-world tasks saying that they resulted in ‘disconnected pockets of data’ that was unsuitable for the development of general scientific principles’. However, with the current level of fragmentation, we show that the same problem also exists even with laboratory tasks in motor learning. As a result, we believe that addressing this critical issue is vital for the field of motor learning. Many of the key steps outlined in Fig. 3 (standardizing protocols, dependent variables and open data sharing practices) have been recently discussed in other behavioral sciences for conducting large-scale multi-lab studies and clinical trials (Open Science Collaboration 2015; Adolph et al. 2012; Kwakkel et al. 2017; Frank et al. 2017), and it is our view that the field of motor learning may also benefit from such an effort.

Two related questions remain—(i) has this fragmentation always been the case in motor learning, and (ii) what are the underlying reasons for fragmentation? For the first question, we note that an early attempt to standardize tasks was undertaken in the “Learning Strategies Project” (Donchin 1989), which developed a computer game called Space Fortress (Mané and Donchin 1989) to allow direct comparisons between different learning strategies. Strikingly, the primary rationale stated for building a common computer game was task fragmentation as evidenced by the following quote “…it was quite evident that the diversity of paradigms and theoretical approaches within which the phenomena were studied, and the models tested, made it very difficult to compare results across studies” (Donchin 1989). Therefore, we do not think that the problem of task fragmentation is recent, although it is likely that the problem may have worsened in recent years as experimenters developed the tools to build their own hardware and software. For the second question of why such fragmentation exists, we can only speculate that there may be a number of factors that drive this fragmentation—incentives for novelty over replication, experimental ‘traditions’ that are handed down from mentors to graduate students, or even seemingly mundane issues like the limited availability of space or equipment, which increases the likelihood of creating new paradigms to shoehorn them to existing resources. Regardless of the underlying reasons, we suggest that it is time for research efforts to coalesce around a few model task paradigms for a more robust science that researchers in the field can build upon.