1 Introduction

Although a clear relationship between skill and success exists in sport, there is currently a paucity of literature reviewing the characteristics of existing tests examining skill [1], with the majority of the literature to date focusing on physical determinants of performance. Although tests of specific skill outcomes date back over 50 years [27], outdated methodology and undefined measurement properties (i.e., reliability, validity and responsiveness) often limit their usefulness. Tests of skill outcomes have widespread utility in research, in particular for the purpose of assessing the effect of coaching or scientific interventions on performance [810]. Recent studies have also utilised these tests to investigate the effects of nutrition [1114], game-specific fatigue [15], performer focus of attention [16] and pre-skill execution routine [17] on participant performance. Further, a body of work exists in team-based field sports such as football in assessing participant skill (amongst other factors) within simulated match-play environments [1820].

The prevalence of skill outcome tests being used in the field is also widespread. For example, the use of data or scores obtained from appropriately designed assessments can potentially eliminate the need to collect longitudinal information on an athlete, for the purposes of rating or ranking them either individually or against their peers. Further, these tests can also been used to assist in identifying relative strengths and weaknesses of the performer [2123], monitor progress of an athlete within a structured training programme [2224], provide information on predictive performance potential [8, 23], inform improved practice and training complexity/specificity [25] as well as provide a time-efficient method of defining participant ability levels [26].

Recently, skill outcome tests have been considerably useful for the purposes of identifying talent in sport [8, 21, 22, 2729]. For example, team-based competitions such as the Australian Football League in Australia and the National Football League in the United States employ multidisciplinary testing ‘combines’ in their player drafting processes that involve each participant receiving a score based on an outcome of a specific test. Although these events have traditionally focused on physiological assessments, in an attempt to account for additional attributes associated with producing a high level of performance in these sports, tests examining skill outcomes such as kicking, passing and throwing accuracy have also been assessed in recent times.

However, the use of skill outcome tests, used either in isolation or as part of a multi-disciplinary assessment protocol, has also been the topic of considerable discussion recently [28, 3035]. This debate appears to centre predominantly on (a) the representative design of currently utilised testing methods and (b) the ideal level of specificity and detail included in such assessments. In particular, the latter consideration has focused on whether designed tests should assess participants on a series of technical-based actions or indicators, as opposed to scoring the relevant skill outcome alone (although a combination of both has been used). The decision made by test designers to utilise either approach may have contrasting advantages with relation to reliability, validity, feasibility as well as the intended purpose for undertaking the test. For example, it is evident that the processes that contribute to skilled outcomes in sporting scenarios exhibit considerable inter- and intra-individual variability [3639], potentially rendering assessments of such components inherently unreliable [40]. This can also be a consideration in the test design of skill outcomes, with recent work showing differences in the reliability of soccer passing versus shooting in testing scenarios [24, 29]. Additionally, tests assessing outcomes of skill in isolation can also face issues in displaying adequate validity, at least in part due to the context in which they are undertaken; often the situational, task-strategizing and decision-making components of undertaking the particular action are not able to be considered [41, 42]. Irrespective of this discussion, tests examining skill outcomes are used for a range of purposes; however, there appears to be no formal system in place with which to evaluate their measurement properties.

Regardless of whether a test has been developed for research or practical purposes, it is well established that it should display acceptable measurement properties; this has, in particular, been well addressed in medical and health-related fields [4348]. However, despite widespread use, studies investigating such tests in sport may not consistently report these properties. Although tests of physiological performance have been the subject of review in recent times [1], to our knowledge, only three specific studies specifically examining sport performance assessments have been published. Of these reviews, two exclusively addressed football (soccer) [10, 27] whilst also discussing in some depth the physiological and technical contributors to performance [1, 20, 27].

Therefore, in considering the suitability of sporting skill outcome tests, a number of rating items should be considered. Firstly, detailed descriptions of methodological quality and study characteristics are important so that results can be considered with relevance to the population being examined. For example, the properties displayed by a skill test when undertaken by elite participants should not be assumed to be similar when being utilised with participants of lesser ability level or, for example, the opposite sex. Additionally, the provision of this information allows for accurate reproduction and comparison of studies by future researchers or coaches implementing the test in the field. Such descriptions should therefore be inclusive of a number of components including specific details on the participants themselves [44, 49, 50], inclusion and exclusion criteria [44, 50], consideration of sample size [44, 49, 51], reporting of floor and ceiling effects [44], stability of test conditions and participants between retest periods [44, 49] and the test–retest interval duration [44, 51].

As multiple trials often form part of a testing protocol scoring system [8, 52] and may be necessary in order to gain a better representation of a participant’s actual ability [22, 53], studies should also be examined for evidence of reporting reliability. Further, three main types of validity are typically stated as being important characteristics to the investigation of the quality assessment of a test, and therefore also warrant reporting. These are content validity, construct validity and criterion-related validity [12, 5457].

Feasibility is another test property commonly examined in the health/medical literature [46, 49, 58, 59]. In the context of this review, it can be defined as the ease in which a test can be undertaken, administered and scored or rated [4958]. Feasibility is of particular importance to sport, where tests need to be practical for the environment they are intended to be used within, or will be likely to experience limited use by athletes, coaches and researchers. It could be reasoned that skill outcome tests have been particularly popular in their use as they are relatively easy to score and can often be undertaken without the use of potentially expensive high-end equipment.

The primary aim of this systematic review was to examine the methodological quality of sporting skill outcome tests reported in the peer-review literature as well as report the types and level of measurement properties investigated in these tests. A secondary aim of this review was to examine factors related to the feasibility and limitations of the identified tests.

2 Methods

A number of methods for reporting items in systematic reviews exist in the literature [43, 44, 6062]. However, due to their lack of specificity for use in systematically assessing measurement properties of variables/tests and feasibility-related issues, a customised framework based on previous literature was required to be developed for use in this review. A similar approach has been undertaken in previous systematic reviews examining test measurement properties in other disciplines [50, 6365], although wherever possible the COSMIN (Consensus-based Standards for the selection of health Measurement Instruments) framework [44] was in particular deferred to where possible. Additional considerations relating to the design of this framework (as well as the rating items contained within) were informed by a number of additional sources, including studies assessing similar domains [1, 24], validated systematic review guidelines and checklists [54, 61, 62, 66, 67] as well as other reviews which have utilised a customised model [50, 64]. This process is described in greater detail in Sect. 2.4.

2.1 Search Strategy

The literature search for this review was undertaken between July 2012 and March 2013 by the first author (SR) using the SPORTDiscus and MEDLINE databases. Key words utilised in the search using multiple combinations of AND/OR phrases included ‘skill’, ‘measurement’, ‘test’, ‘assessment’, ‘reliability’, ‘validity’, ‘testing’, ‘elite’, ‘sport’, ‘instrument’, ‘sporting’, ‘practical’, ‘outcome’, ‘reproducibility’, ‘task’ and ‘feasibility’. Further studies were collected following examination of citations present within the collected publications (‘snowballing’).

2.2 Inclusion Criteria

Initial pilot testing of the search strategy in February 2012 revealed multiple studies relating to the design of skill tests as far back as 1958. However, no studies prior to 1990 were found to have met the inclusion criteria described below; therefore, in facilitating the search process, articles were required to be published after 1990 and up to and including March 2013. Additional inclusion criteria for studies examining skill outcome tests in this review were as follows: (a) each publication addressing a skills test collated from the abovementioned search strategy must have been peer reviewed and written in English; (b) abstracts of each article were required to be present in the database search; (c) articles describing the use of a multidisciplinary testing battery could be included provided the skill outcome testing component could be extracted and reviewed separately to other assessment items.

2.3 Exclusion Criteria

The following criteria resulted in exclusion of studies for this review: (a) articles not reporting at least one component of either reliability or validity of the developed test; (b) articles that reported physiological function or specific motor skills not directly relevant to the sport investigated or assessing a skill outcome; (c) articles utilising tests that had their measurement properties investigated previously elsewhere; (d) articles that stated utilising minor adaptations of tests investigated previously; and (e) any articles that had been withdrawn from publication. Further, (f) studies examining tests rating or scoring participants on technical processes in isolation from recordable skilled outcomes were excluded. For example, tests that rated combinations of technical criteria in order to produce a score were excluded as they were not assessing the skill outcome per se. Studies that examined both processes in addition to a skill product or outcome had the latter components extracted for review wherever possible.

2.4 Data Extraction

As the validity of using customised scored review templates for systematically reviewing measurement properties and feasibility of skill outcome tests is yet to be defined [52], quantitative ratings for each of the reviewed items were not provided. The assessment items used in this review were based on study quality, test measurement properties (reliability, validity and responsiveness) and feasibility. Wherever possible, data pertaining to the measurement properties of each instrument were recorded.

A total of seven items were used to rate study quality and the operational definitions have been reported in Table 1. These items were the level of detail provided on study participants, whether participant inclusion/exclusion criteria were reported, the size of the participant sample, whether floor and ceiling effects were reported, whether familiarisation was undertaken with the participants prior to testing, whether the stability of both participants and testing conditions was accounted for, and lastly the reporting of the length of the test–retest interval. Although a variety of methods can be used to determine appropriate sample size [6870], absolute sample size values were used to allow direct comparison across studies [44].

Table 1 Details of review items relating to study methodological quality

Information relating to test–retest reliability and inter/intra-rater reliability were also retrieved, with the type and level of reliability both assessed (operational definitions provided in Table 2). Additionally, due to the large variety of statistical analyses in studies, reliability statistics for only the six most commonly reported approaches were reported. These were: coefficient of variation (CV%), intraclass correlation coefficient (ICC), correlation coefficients (r), 95 % limits of agreement (inclusive of ratio limits of agreement) (LoA and RLoA, respectively), typical error of measurement (TEM) and generalisability theory. Although specific ratings were provided for studies that reported ICC and r values, no published guidelines were found relating to what constituted an acceptable level of reliability for the remaining four statistical approaches. Consequently, ratings of numerical results were not provided in studies that reported reliability using solely these methods.

Table 2 Details of review items relating to measurement properties and feasibility

Operational definitions relating to validity are reported in Table 2. Although some evidence exists supporting the use of both the kappa statistic and the content validity index (the proportion of a small group of experts that agree on a certain item being included in the assessment of a domain) to determine content validity [55, 71, 72], these have not been widely reported in the sport literature. A more common method has been the use of ‘expert’ panels or coaching groups to develop test items. Whilst there are limitations to this process [73], it is nonetheless used substantially in the relevant literature. Therefore, for the purposes of this review, content validity was rated according to whether a study gained concession by an expert panel for the items assessed in the test. Construct validity was considered as inclusive of both discriminative and convergent validity [54, 55, 74, 75], whilst criterion-related validity included a consideration of both the concurrent and predictive properties of the test [54, 55, 74]. In assessing these types of validity, some research has defined correlation coefficients in excess of 0.65 [48] or 0.70 [76] as appropriate; however, support also exists for values of between 0.30 and 0.50 as being acceptable [49, 74, 76, 77]. Although such correlation data was reported in some of the reviewed studies, due to the variety of statistical approaches utilised, studies were assessed on whether these measurement properties were investigated by the authors, as opposed to reporting results. However, the statistical approach used was reported wherever possible.

Operational definitions for responsiveness and feasibility characteristics are also reported in Table 2. Test responsiveness can be assessed by calculating the ratio of the clinically relevant change to the standard deviation of the intra-participant test–retest differences [78, 79], or by referring to the test’s effect size [58, 74]. Other common methods include obtaining the minimum clinically important difference (MCID) [80] or comparing median test scores from multiple rounds of testing [81]. In this review, studies were rated on whether data relating to the undertaking of any of these approaches were reported, with the length of the interval observed between these two (or more) rounds of testing also obtained. As studies should also focus on interpretability; they were also rated on whether they provided information relating to the minimum important change or difference. Finally, components relating to test feasibility and limitations were also recorded. As such, information relating to practicality, test duration, intended context, the presence of a familiarisation session/s and consideration of test limitations were all also extracted for the purposes of rating [46, 58]. No appropriate published quantitative values of feasibility item types for the kind of tests investigated in this review were found; therefore, studies were rated on whether each of these areas were included.

A customised Microsoft Excel™ spreadsheet was developed to record the abovementioned extracted data from each of the studies reviewed. All data from each study was extracted by two authors independently. Prior to undertaking this assessment it was stipulated that any instance where the two reviewers provided conflicting scores for any of the criteria, the paper would be re-assessed. However, this did not occur at any stage throughout the review process.

3 Results

A total of 604 articles were found as a result of the initial search strategy and snowballing processes. An outline of the search results and reasons for exclusion has been provided in Fig. 1. It should be noted that 34 studies were excluded from the review as they examined tests of motor skills not directly relating to a performance outcome. Further, 10 studies were also excluded as they detailed only minor revisions of existing, original versions of tests already included in the review. As a result of applying the inclusion and exclusion criteria, a total of 22 studies remained for inclusion in the review. Of these 22 studies, five described skill outcome tests designed for use in football, three each for volleyball and golf, two for hockey, with one each for tennis, rugby league, squash, water polo, netball, rock climbing, racquetball, wheelchair basketball and quad rugby. Table 3 provides a description of the characteristics of the reviewed studies.

Fig. 1
figure 1

PRISMA flow diagram

Table 3 Study characteristics of the 22 articles included in the review

3.1 Study Methodological Quality

Table 4 displays the results of the study quality assessment undertaken for the skills tests. Of the studies reviewed, 59 % were shown to have adequately stated participant characteristics, with 36 % receiving a partial score. Only 14 % of the reviewed studies stated both inclusion and exclusion criteria adequately with a further 18 % of the total studies providing inclusion criteria only. A range of participant sample sizes were noted across the studies (n = 11–313) with 18 % utilising a sample size of n > 50 and just 14 % recruiting n >100. Floor and ceiling effects of participant scores were only reported in a small number (14 %) of cases. A total of 64 % of studies also implemented familiarisation sessions as part of their tests. In 68 % of studies, the stability of both the participants and test conditions were adequately reported, with a further 14 % receiving a partial rating. Test–retest intervals ranged from 10 min to 28 days, with 77 % of studies reporting this detail. Same-day retesting was undertaken in 18 % of these studies, whereas 68 % implemented retesting sessions that were undertaken within 1 week of the initial assessment.

Table 4 Study methodological quality items of the reviewed skill tests

3.2 Reliability

Table 5 displays results relating to the rating of the measurement properties and feasibility characteristics of the reviewed skills tests. Of the six statistical approaches used to assess level of reliability, 64 % of studies reported ICCs, 27 % used CVs, 32 % utilised Pearson or Spearman product moment correlations, 18 % reported 95 % LoA (or RLoA), with 14 and 5 % reporting TEM% and generalisability theory, respectively. In just under half (41 %) the studies reviewed, a good to excellent level of test–retest reliability was reported, whereas in the majority of the remaining studies (55 %), a partial rating for reliability was given. Inter-rater reliability was investigated in the three studies that involved testers undertaking assessments of participants and then provided scores on their observations [22, 82, 83]. Inter-rater reliability was assessed using similar techniques as for test–retest reliability, with all studies in this case reporting a form of correlation coefficient (i.e., ICC or an r value). Intra-rater reliability was examined in only 9 % of studies most likely due to it not being considered relevant for investigation in the majority of cases [22, 84].

Table 5 Measurement properties and feasibility of the reviewed skill tests

3.3 Validity and Responsiveness

Content validity was assessed in 68 % of the studies reviewed and was determined (at least in part) through consultation with a panel of experts or coaches in 27 % of cases [8, 22, 26, 84, 85]. Only one study generated and reduced test items through mail-based Delphi rounds [82]. The remaining studies (36 % of the total number reviewed) used a combination of review of literature and an assessment of actual game/competition demands.

Construct validity was determined in 64 % of these studies with most utilising the existing status of the participant (professional competing, high-level amateur or amateur) as the construct for categorisations of ability. Of these studies, 71 % used between-group comparisons of test scores (i.e., via t tests or ANOVA) to determine whether differences existed between ability level, whereas the remaining 29 % used minimum clinically important differences (MCID) values or correlational or factor analysis. A total of 36 % of studies also reviewed investigated criterion-related validity in their skill outcome tests. All of these determined the level of association with a concurrent measure including comparisons with expert/coach rankings provided prior to testing [24, 82, 84, 86, 87], or comparisons of observed scores with expected participant rankings (based on external scales) [13, 29, 88].

Only 14 % of studies examined a test’s ability to predict future performance, with all utilising correlational analysis to determine the relationship of participant score with rankings and/or performance in subsequent tournaments or competitions [26, 84, 86]. Further, only 9 % investigated the responsiveness of their testing protocol. These studies reported MCID [22, 25] and utilised data taken from a post-testing session undertaken 4 weeks later [48] to assess this measurement property. Additionally, 32 % of studies reported the minimum important change or difference as part of their investigation.

3.4 Feasibility and Limitations

Test feasibility considerations and test limitations were addressed in 50 % of the studies reviewed. A further 36 % received a partial score, with the reduction in rating predominantly due to the lack of information provided regarding the limitations of the test. Of the 22 studies, 55 % also reported the intended context or use for their designed skill test, or it was implied due to the purpose of the study. Of the studies providing this information, 42 % stated the related protocols may be of use for the purposes of evaluating the success of interventions [8, 9, 21, 24, 26, 89, 90], with 17 % specifically developing their instrument to examine the effects of nutritional or ergogenic aid supplementation [12, 88]. A further 17 % stated a use for their protocol in talent identification [21, 22, 91], with other reasons including a time-efficient manner of defining and monitoring participant development [22, 87], a method of benchmarking participants [85, 92] and a process in which to inform an increase in practice schedule design or complexity [25, 91]. Time to complete the tests was reported in 41 % of studies with values ranging from 20 to 90 min, although it is worth noting that the longest test was part of a multidisciplinary testing battery assessing other non-skill domains.

4 Discussion

The overarching objective of this study was to (a) identify sporting skill outcome tests reported in the peer-reviewed literature and (b) systematically review these studies based on their methodological quality and measurement properties reported. Considerations relating to test feasibility were also examined. Findings from the search strategy revealed there were a relatively small number of studies assessing all measurement properties (i.e., reliability, validity and responsiveness) with just over half adequately investigating some aspect of feasibility.

Despite the reporting of participant characteristics being important for the purposes of test reproducibility, they were not fully described in the majority of cases. In particular, information relating to the specific ability level of participants as well as their anthropometric characteristics was lacking. The external reproducibility of many of the reviewed studies was also potentially compromised due to a lack of clear inclusion and exclusion criteria. Authors should be encouraged to show greater transparency by reporting these criteria in future work. Participant sample considerations in this review related to the size of the cohort(s) investigated. However, as a number of studies recruited professional or elite level participants as part of their investigation, access to a larger population of these cohorts is likely to be more difficult than in many other disciplines [24]. In ensuring sample size is adequate, authors should ideally recruit participants from a range of ability levels, which in turn can also allow for a more thorough investigation of construct validity. Whilst not a rating item in this particular review, it should also be noted that the need for implementation of familiarisation sessions was addressed in the majority of studies where relevant. As results stemming from these preliminary sessions typically noted a retest improvement for lower-level participants in particular [24, 29, 88], these authors should be commended for including such an undertaking as part of the investigation of their tests. The attention provided by many authors to ensuring both testing and participant conditions remained stable between retesting sessions should also be noted.

Whilst a range of test–retest interval durations were reported in the studies reviewed, it is difficult to provide an objective rating on what the exact duration of this test characteristic should be, as it is dependent on the nature of the test itself (i.e., the number and complexity of skilled actions being performed). Regardless, it is important for test–retest intervals to not be too short in duration as (a) this may not allow for adequate examination of the assessments’ temporal stability [54], and (b) often performers may still be fatigued from previous trials [68, 93] (although this is likely to be more of a concern in physiologically exertive assessments). Conversely, excessively long retest intervals can result in large variation of results (thereby affecting reliability); this may be due to seemingly innocuous factors, (i.e., participant circadian variations) [94] or notable skill improvements in participants between the two trials.

An inclusion criterion for this review was that either reliability or validity of each skill test was reported in the reviewed study. Test–retest reliability was the most commonly addressed measurement property reported across the tests reviewed, with all but one of the reviewed studies investigating this property. Of those studies that investigated test–retest reliability, just under half displayed good to excellent repeatability. In the rare circumstances where inter-rater reliability was assessed, good to excellent levels of agreement were found. For ease of reader interpretation, this review reported only the six most commonly used methods in assessing reliability and as such is not a comprehensive representation of the statistical methods available on which to assess this measurement property. Existing systematic review frameworks have recommended rating studies on whether a particular statistical technique is utilised [44]; however, a discussion on this area is beyond the scope of this review and the reader is directed elsewhere for a comprehensive discourse on the pros and cons of available techniques used to assess reliability in this context [68, 93].

It is also worth noting that any investigation of test reliability should include some consideration of the amount of error present in any measurement tools used to assist in the scoring of the assessment. For example, a number of technologies such as radar measurement devices [21, 81], radar speed guns [29, 88, 90], and video cameras [22, 24, 25, 82, 90] were all utilised to obtain data that was directly used in either the scoring or administering of the reviewed tests. In some circumstances, information relating to digitisation techniques and analysis errors were reported; in these cases the authors should be commended for providing such detailed descriptions [22, 29, 82, 90]. Authors are recommended to do likewise when developing future tests where such technologies are integral to the scoring of the protocol.

Due to a lack of widely reported techniques in assessing content validity for sporting skill tests, it was not surprising that for the majority of studies reviewed, no statistical techniques were used to assess this form of validity. It is recommended that wherever possible researchers use a formal process and/or quantitative measure to assess this form of validity, such as the Delphi rounds seen in previous studies [82] or those commonly used in other disciplines (i.e., a content validity index) [54, 72]. The argument for this more transparent approach is supported by the consideration that although in some cases determining the content of a particular testing protocol may seem a relatively simple task, many (in fact, most) sports require multiple skills to be executed. This may mean that one individual test does not assess the entire content of skill and multiple tests may be needed to define a construct more completely [8, 21, 24, 27]. Therefore, sports involving complex and multiple skill domains can pose a particularly difficult problem for researchers. This may be due to multiple or different skills being required within competition (i.e., passing, shooting, catching). Further, and specifically in team sports, both the type of skill requirement and their relative importance may differ between players depending on their role or position within the team. Further still, certain participants may display a high level of aptitude in one domain yet be relatively mediocre or poor in another.

When considering these factors, it is not surprising that there has been some recent debate regarding the appropriateness of assessing different components of skill in isolation from each other, particularly in the football codes [30, 31]. Whilst the approach of concurrently assessing multiple components has precedent in two of the five football-specific studies reviewed here [12, 52], a decision on which skills to include in a test design is likely to depend on the intended use of the protocol. For example, some sports may be better disposed to isolated extraction and testing of items better than others (such as golf, which requires clearly differentiated skills performed in relatively ‘closed’ environments). As shown in Table 3, skill outcomes/domains such as ‘accuracy’, ‘placement’, ‘passing’, ‘shooting’ and ‘time to complete’ tasks were commonly assessed within the studies included in this review. Some authors also implemented minimum skill execution speed [29, 88] or temporal [12] constraints to the design of the protocol with others including the use of dual-task methodology to more accurately assess participant skill [25]. An obvious benefit of the addition of these types of environmental constraints to test protocols can be the improvement of the external validity and/or representative design of the test. With particular reference to skill tests, the latter term is perhaps best described as “how the (test) design…may allow for the maintenance of coupled perception and action processes that reflect the functional behaviour of athletes in specific performance contexts” [35].

Despite the undoubted importance of these methodological considerations, ensuring there is a balance between improving the representative design of a test and maintaining or improving its measurement properties (in particular, protecting against a loss of reliability) can be a quandary for researchers when designing protocols. The development of a test displaying good measurement properties should ideally allow for more specific, concurrent evaluation of the technical processes and actions contributing to the skill outcome. Such an approach can also then allow better investigation of the ‘how’ and ‘why’ of the performance achieved (if relevant to the specific study). However, the initial goal of the researcher should be to develop appropriate measurement properties as a priority. For example, evolution of and amendments to tests occur in other disciplines, with some having undergone considerable changes from initial versions in attempts to increase time efficiency and/or representativeness [95]. Future research and discussion may seek to include better representative task design; however, a lack of a clear definition in this context makes this difficult at present.

With reference to construct validity, although discriminative test characteristics were typically investigated by studies in this review, limited evidence of the investigation of convergent validity was noted. This is can be a particularly perplexing form of validity for investigators in sports performance to assess, as often one of the defining motivations for development of a new test may be because of a gap in the literature and therefore, there may be no similar test to compare the new method to [54]. This may at least partially explain why there were only a small number of cases noted in this review. However, as the number of skill tests reported in the literature continues to increase, such investigations may become both more useful and relevant to researchers. For example, examination of convergent validity may inform the development of a more comprehensive testing assessment than in existing versions and/or help to reduce the length of such protocols (i.e., thereby also increasing test feasibility) [54, 75]. Particularly, if a test requires expensive equipment or is of a particularly long duration, it is unlikely to experience continued use by those working in the field. Whilst the ability of a test to relate to a concurrent measure of the same construct is important for its criterion-related validity, a test displaying a proven ability to predict actual performance (predictive validity) could be considered an even more important characteristic of a test. However, as shown in this review, very few studies of sporting skill outcomes have examined this property.

Similarly, the evaluation of a test’s responsiveness was rarely investigated in the studies included in this review. This is despite the fact that responsiveness is routinely investigated in other fields of research such as epidemiology [78], or when examining quality of life [79] or rehabilitation outcomes [81, 96, 97]. Similarly to test–retest reliability, investigation of a test’s responsiveness requires access to the same group of participants for repeat assessments and therefore can be difficult when examining samples such as elite athletes who may have competition and/or training schedules that conflict with the ideals of test designers. In particular, when using these populations, investigators need to consider the ethical implications of excessive testing whilst ensuring the benefits from the testing outweigh any potential athlete burden. Ongoing, mutually beneficial collaborations with sporting bodies can potentially present researchers with suitable opportunities to investigate this particular measurement property of their tests.

Whilst the need for a test to display acceptable measurement properties is clearly important, its usefulness as a tool for researchers and coaches is reduced if it is not feasible or practical. Whilst less than half of the studies in this review stated the potential use of their tests as well as their limitations, a number of practical considerations went largely undiscussed. For example, other considerations such as the availability and cost of equipment [59, 98], the ease of incorporating the test with participants of different ability levels [59], level of participant enjoyment, number of participants to be tested [59], and the availability of skilled examiners [98] were not routinely reported. Some investigations into test feasibility in other fields have utilised standardised expert or coach interviewing to rate some of the test components post-testing. This included the perceived value of the assessment (by the rater, participant and coach), ease of scoring [59, 98], time taken to explain and set up the test [54, 59, 96], as well as the availability of equipment provided [58, 59, 96]. Therefore, it is evident that feasibility requires further consideration in studies of the nature reviewed here.

Whilst the duration of a test may be dependent on both the sport and the skill itself, it is logical to suggest that implementation of the test should be shorter than the actual competition itself. Tests of excessive duration may have the potential to induce fatigue [68] and/or cause the performer (or their coach/coaches) to lose interest or motivation in undertaking the assessment. This may be of particular concern when undertaking tests with younger participants, where increased pressure may also cause poor and unrepresentative performance of participants.

Duration of a test, however, will also be highly dependent upon the number of trials undertaken, which in turn is influenced by the number of trials required to gain a true representation of a participant’s ability. In many sports, a single trial may suffice and may actually be representative of the task being assessed; however, there may be a need for multiple trials in some skill tests. This may particularly be the case in sports of a continuous nature. This consideration, most likely combined with an intention to produce adequate reliability (termed the Spearman-Brown prophecy) was noted in almost all of the tests reviewed. However, although quite likely well justified in these cases, in most studies the number of repeated trials utilised appeared to be decided arbitrarily. Test designers should look to base the optimal number of trials on objective evidence. For example, in other disciplines, particular testing items may have their weightings adjusted according to their importance to the testing construct [95, 99]. Further, item reduction techniques such as Rasch analysis and item concept retention can also be used to reduce the number of items within an instrument while also maintaining high levels of test–retest reliability [95, 99].

4.1 Limitations

A limitation of this review was the inability to undertake any form of meta-analysis. This was due to the considerable variety of statistical procedures used to determine test measurement properties. Additionally, it should be noted that findings from this review may not be generalisable due to the relatively small number of sports examined in the studies contained therein. As different sports will contain contrasting skill components and expressions of performance, the sports investigated here provide only an overview of the sports contained within. Further, it is likely that tests currently exist in use within practical environments that have not been reviewed here due to not being reported in the literature.

5 Conclusions

This review assessed the methodological quality, measurement properties and feasibility of 22 studies reporting tests of sporting skill. Methodological quality of the studies was mixed, with minimal information provided on inclusion and exclusion criteria and optimising sample size. Implementation of familiarisation sessions and a consideration of participant and testing condition stability were present in the majority of studies. A range of methods and statistical procedures have been used by researchers to determine the measurement properties of their skill outcome tests, thereby making direct comparison of studies difficult. Test–retest reliability was determined in all but one of the reviewed studies, whilst most investigated at least two aspects of validity (i.e., content, construct or criterion-related validity). However, a distinct lack of specific investigation into both the predictive validity and responsiveness of skill outcome tests was noted. While some aspect of feasibility was addressed in just under half of the studies, considerations relating to test practicality were not formally investigated in any of the studies. As the items for this review were extracted from a number of existing models reported in other disciplines, future work may look to develop a specific framework for use in the sports sciences. Until then, a consideration of the study quality characteristics, measurement properties and feasibility items outlined in this review can assist future researchers in the development and or modification of skill tests in sport.